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Diffuse  Prior  Monotonic  Likelihood  Ratio  Test  for 
Evaluation  of  Fused  Image  Quality  Measures 

Chuanming  Wei,  Lance  M.  Kaplan,  Senior  Member,  IEEE ,  Stephen  D.  Burks,  and  Rick  S.  Blum,  Fellow,  IEEE 


Abstract — This  paper  introduces  a  novel  method  to  score  how 
well  proposed  fused  image  quality  measures  (FIQMs)  indicate  the 
effectiveness  of  humans  to  detect  targets  in  fused  imagery.  The 
human  detection  performance  is  measured  via  human  perception 
experiments.  A  good  FIQM  should  relate  to  perception  results  in 
a  monotonic  fashion.  The  method  computes  a  new  diffuse  prior 
monotonic  likelihood  ratio  (DPMLR)  to  facilitate  the  comparison 
of  the  H i  hypothesis  that  the  intrinsic  human  detection  perfor¬ 
mance  is  related  to  the  FIQM  via  a  monotonic  function  against 
the  null  hypothesis  that  the  detection  and  image  quality  relation¬ 
ship  is  random.  The  paper  discusses  many  interesting  properties 
of  the  DPMLR  and  demonstrates  the  effectiveness  of  the  DPMLR 
test  via  Monte  Carlo  simulations.  Finally,  the  DPMLR  is  used  to 
score  FIQMs  with  test  cases  considering  over  35  scenes  and  var¬ 
ious  image  fusion  algorithms. 

Index  Terms — Fused  image  quality  measures  (FIQM),  hypoth¬ 
esis  test,  image  fusion,  monotonic  correlation  (MC). 


I.  Introduction 

IN  RECENT  years,  image  fusion  has  been  attracting  a  large 
amount  of  attention  in  a  wide  variety  of  applications  such 
as  concealed  weapon  detection  [1],  remote  sensing  [2],  intel¬ 
ligent  robots  [3],  medical  diagnosis  [4],  and  military  surveil¬ 
lance  [5].  Image  fusion  refers  to  generating  a  combined  image 
in  which  each  pixel  is  determined  from  a  set  of  pixels  in  each 
of  the  source  images.  The  fused  image  should  provide  an  easier 
view  for  a  human  to  interpret  the  scene  than  any  of  the  source 
images,  thus,  improving  the  performance  of  the  human  in  ac¬ 
complishing  his/her  task.  The  interested  reader  is  referred  to  [6, 
Ch.  1]  for  a  survey  of  various  image  fusion  algorithms  devel¬ 
oped  in  past  years. 
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Measuring  the  performance  of  image  fusion  algorithms  is 
an  extremely  important  task,  which  has  received  past  study 
[7]— [22] .  The  performance  of  image  fusion  algorithms  is 
primarily  assessed  by  perceptual  evaluation  in  the  form  of 
subjective  human  tests  [13].  Typically  in  these  tests,  human  ob¬ 
servers  are  asked  to  view  a  series  of  fused  images  and  rate  them. 
Because  images  are  fused  for  better  human  interpretation,  it  is 
more  important  to  judge  fusion  methods  by  how  well  humans 
are  able  to  perform  interpretation  tasks.  Examples  of  human 
interpretation  studies  for  image  fusion  evaluations  appear  in 
[17],  [22].  No  matter  the  goal  of  the  human  perception  test, 
these  tests  are  inconvenient,  expensive  and  time  consuming. 

It  is  clearly  highly  desirable  to  identify  an  objective  perfor¬ 
mance  measure  that  can  accurately  predict  human  perception  by 
determining  the  quality  of  the  fused  image.  The  objective  mea¬ 
sure  should  be  a  feature  that  is  obtained  via  an  automatic  com¬ 
putation  employing  the  fused  image  and  can  serve  as  a  surrogate 
for  human  perception  results.  We  refer  to  such  a  feature  as  the 
fused  image  quality  measure  (FIQM).  If  a  good  FIQM  can  be 
devised,  then  one  can  compare  image  fusion  algorithms  without 
expensive  perception  experiments.  Furthermore,  the  measure 
can  be  used  as  a  design  criteria  for  an  “optimal”  image  fusion 
algorithm. 

In  the  literature,  three  broad  classes  of  FIQMs  have  been  pro¬ 
posed.  The  first  class  represents  full-reference  measures.  They 
require  a  reference  fused  image  (or  the  ground  truth  image)  that 
represents  the  “ideal”  image  of  the  scene.  Once  the  ground  truth 
image  is  given,  one  can  use  existing  quality  metrics  such  as  the 
mean  square  error,  the  peak  signal  to  noise  ratio,  or  more  sophis¬ 
ticated  measures  such  as  structure  similarity  [23]  to  compare  the 
fused  images  with  the  reference.  In  the  image  compression  ap¬ 
plication,  the  uncompressed  image  represents  the  ideal,  and  it 
has  been  demonstrated  that  the  structure  similarity  is  a  mean¬ 
ingful  full-reference  measure  [23].  For  the  image  fusion  appli¬ 
cation,  it  is  only  possible  to  generate  a  reference  image  for  some 
special  cases  (for  instance,  the  multifocus  image  fusion  [8]).  In 
most  cases,  one  has  to  resort  to  other  classes  of  FIQMs  that  do 
not  require  a  reference  image.  The  second  class  of  FIQMs  rep¬ 
resents  source  comparative  measures  that  utilize  partial  infor¬ 
mation  about  the  scene,  e.g.,  the  source  images  that  were  col¬ 
lected  and  utilized  as  input  to  the  image  fusion  process.  This 
class  of  FIQMs  has  recently  received  a  great  deal  of  attention 
[9]— [  12].  These  measures  quantify  the  amount  of  information 
transferred  from  the  source  images  to  the  fused  image  by  con¬ 
sidering  the  sum  of  correlations  between  each  source  image  and 
the  fused  image.  An  analysis  of  this  class  of  FIQMs  is  provided 
in  [14].  The  third  class  of  FIQMs  represents  no-source  compara¬ 
tive  measures.  These  measures  attempt  to  extract  the  salient  fea- 
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tures,  such  as  the  structure,  texture,  contrast  and  edge  informa¬ 
tion,  directly  from  the  fused  image  without  regard  to  the  source 
images  [17]— [21], 

Quantitatively  evaluating  the  image  fusion  performance  is 
a  complicated  issue  because  of  the  lack  of  a  complete  under¬ 
standing  of  the  human  visual  system  (HVS),  and  because  of  the 
variety  of  image  fusion  applications  [15].  We  expect  that  the 
FIQM  should  be  task  specific,  and  the  best  measure  changes 
from  task  to  task.  Given  an  image  fusion  application  and  many 
kinds  of  proposed  FIQMs,  we  are  interested  in  which  quality 
measure  better  describes  the  performance  of  the  human  inter¬ 
preting  the  fused  imagery. 

Ideally,  the  FIQM  for  a  given  image  would  reveal  how  well  a 
human  can  interpret  the  image  for  a  given  task,  i.e.,  it  can  pre¬ 
dict  human  performance.  One  can  achieve  this  aim  by  inventing 
a  measure  that  linearly  fits  the  human  perception  performance. 
In  [24],  the  authors  have  shown  an  evidence  of  the  approxi¬ 
mately  linear  fitness  between  image  quality  (IQ)  measures  and 
the  subjective  rating  of  image  distortions.  Flowever,  an  image  is 
a  projection  of  a  particular  scene,  and  the  context  in  the  scene, 
i.e.,  the  relationship  of  the  objects  in  the  scene,  can  affect  the 
ability  of  human  to  perform  a  particular  task  (target  detection 
for  example).  Since  the  linearity  is  a  stricter  requirement  than 
monotonicity  for  a  FIQM  and  is  harder  to  achieve  under  various 
context,  we  believe  that  it  will  be  more  difficult  to  guarantee 
linearity  when  the  IQ  is  used  to  predict  the  ability  of  a  human 
to  interpret  the  image  for  a  given  task.  Thus,  we  focus  on  the 
monotonicity  criterion  in  this  paper. 

By  monotonicity  we  mean  that  a  realistic  FIQM  can  deter¬ 
mine  the  relative  ranking  of  human  performance  over  a  series 
of  fused  images  derived  from  the  same  exact  source  images, 
which  we  now  refer  to  as  a  scene.  For  a  given  scene,  as  FIQM  in¬ 
creases  over  a  series  of  fused  images,  human  performance  over 
these  images  should  also  increase.  If  the  human  performance  is 
consistently  decreasing,  the  measure  is  still  good  as  it  can  be 
trivially  transformed  into  a  proper  FIQM  via  a  reciprocal  oper¬ 
ation.  Thus,  a  potential  FIQM  should  be  judged  by  how  well  a 
monotonic  function  (ascending  or  descending)  explains  the  re¬ 
lation  between  the  FIQM  and  human  performance  over  a  variety 
of  fused  imagery  representing  the  same  scene.  In  addition,  the 
nature  of  the  monotonic  relationship  (ascending  or  descending) 
should  be  consistent  from  scene  to  scene.  Overall,  a  statistic  that 
quantifies  how  well  different  FIQMs  are  consistent  with  actual 
human  performance  is  necessary. 

This  paper  focuses  on  scoring  FIQMs  for  the  case  of  the  de¬ 
tection  task.  Performance  is  measured  by  the  probability  that  a 
human  observer  can  correctly  detect  certain  objects  in  the  fused 
image.  The  human  perception  experiments  measure  the  number 
of  observers  that  are  able  to  correctly  detect  ground  truthed  tar¬ 
gets  as  the  human  performance.  This  performance  metric  can  be 
reasonably  modeled  by  a  binomial  distribution.  This  paper  intro¬ 
duces  a  new  monotonic  statistic  for  the  object  detection  task  that 
is  applicable  when  the  underlying  perception  results  are  derived 
from  a  small  number  of  human  observers.  To  handle  a  small 
number  of  observers,  this  statistic  does  not  make  Gaussian  as¬ 
sumptions  about  the  performance  measurements. 

Previous  work  does  exist  to  objectively  score  the  effective¬ 
ness  of  FIQMs.  In  [16],  Pearson  (or  linear)  correlation  and  root 
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mean  squared  error  (RMSE)  are  used  to  score  potential  FIQMs. 
The  Pearson  correlation  quantifies  how  well  a  straight  line  fits 
the  mapping  between  the  input  and  output  sequences.  Unfortu¬ 
nately,  when  the  relationship  between  the  quality  measure  and 
the  human  performance  is  nonlinear,  the  value  of  Pearson  cor¬ 
relation  can  be  small  despite  the  fact  that  the  sequences  are  still 
monotonically  related.  In  essence,  a  proper  statistic  needs  to  de¬ 
termine  if  the  ordering  of  a  quality  measure  preserves  the  or¬ 
dering  of  the  corresponding  human  performance  measures. 

The  Spearman  and  Kendall  correlations  [25],  [26]  are 
common  statistics  to  quantify  how  well  the  output  sequence 
is  ordered.  In  fact,  the  Spearman  correlation  has  been  used 
to  evaluate  the  quality  measures  for  video  streams  [27].  Both 
quantities  are  invariant  to  monotonic  transformations  of  both 
the  input  and  output  sequences  because  monotonic  transforma¬ 
tions  preserve  the  rank  order  of  the  sequences.  For  evaluation 
of  the  utility  of  FIQMs,  a  miss-ordering  of  human  performance 
values  that  are  nearly  identical  should  not  lower  the  correlation 
value  too  much.  Because  only  ranks  and  not  actual  values 
are  considered,  the  reduction  in  correlation  score  due  to  these 
miss-orderings  can  be  identical  or  even  greater  than  that  of 
miss-orderings  of  widely  varying  human  performance  values. 
Clearly,  measurement  noise  can  greatly  impact  the  correlation 
scores.  Therefore,  these  rank-order  correlations  are  not  appro¬ 
priate  for  seeking  out  good  FIQMs. 

In  [23],  [27],  a  nonlinear  regression  fit  to  a  logistic  function 
followed  by  linear  correlation  is  used  to  compare  various  FIQMs 
in  order  to  accommodate  the  nonlinear,  but  monotonic,  relation¬ 
ships.  Recently,  the  monotonic  correlation  (MC),  which  uses 
isotonic  regression  followed  by  linear  correlation  has  been  pro¬ 
posed  in  [17].  As  demonstrated  in  [17],  the  MC  is  more  flexible 
than  linear  correlation  or  the  logistic  analysis  in  [23],  [27],  Like 
linear  and  logistic  correlation,  the  MC  assumes  that  the  percep¬ 
tion  error  is  Gaussian,  which  is  inappropriate  for  the  detection 
task  when  the  number  of  observers  is  small. 

To  our  knowledge,  this  paper  represents  the  first  attempt  to 
score  the  effectiveness  of  FIQMs  for  the  detection  task  in  light 
of  practical  measurements  from  human  perception  experiments. 
To  this  end,  the  paper  develops  a  novel  statistic  to  test  whether  or 
not  a  monotonic  relationship  exists  between  the  proposed  FIQM 
and  the  human  performance.  The  monotonic  statistic  is  general 
and  can  be  applied  to  other  applications  when  one  may  need  to 
test  for  a  monotonic  relationship.  A  preliminary  version  of  this 
work  has  appeared  in  [28], 

The  paper  is  organized  as  follows.  Section  II  presents  the 
perception  model  and  introduces  the  new  monotonic  statistic. 
Section  III  demonstrates  the  effectiveness  of  the  new  statistic 
via  Monte  Carlo  simulations.  The  statistic  is  used  to  score  po¬ 
tential  FIQMs  against  actual  perception  results  for  fused  im¬ 
ages  in  Section  IV.  Finally,  Section  V  provides  some  concluding 
remarks. 

II.  Statistical  Monotonic  Analysis 

The  paper  focuses  on  the  detection  task  and  measures  the  per¬ 
formance  of  image  fusion  algorithms  by  the  probability  that  a 
human  observer  can  correctly  detect  certain  objects  in  the  fused 
image.  This  section  develops  the  test  statistic  that  compares  the 
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hypothesis  that  the  relationship  between  human  detection  per¬ 
formance  and  FIQM  values  are  monotonic  to  the  hypothesis  that 
the  relationship  is  random.  The  statistic  is  based  upon  the  model 
that  each  image  exhibits  a  ground  truth  quality  score,  which 
is  the  probability  that  any  human  can  detect  the  object  in  it. 
Section  II-A  derives  the  likelihoods  for  each  hypothesis  condi¬ 
tioned  on  these  ground  truth  quality  scores.  Then,  Section  II-B 
uses  an  uninformative  prior  for  the  ground  truth  quality  scores  to 
define  the  likelihood  ratio  so  that  it  is  computationally  feasible 
as  demonstrated  in  Section  II-C.  Finally,  Section  II-D  presents 
properties  of  the  test  statistic. 


A.  Data  Models 

A  scene  is  a  realization  of  F  source  images,  and  we  con¬ 
sider  N  different  fusion  algorithms.  The  existence  (or  lack)  of  a 
monotonic  relationship  between  measured  human  performance 
and  computed  FIQMs  can  be  inferred  over  S  scenes.  To  this 
end,  this  subsection  provides  the  data  models  that  enable  this 
inference. 

For  a  given  scene,  let  the  N  X 1  vector  p  =  (pi,p2 ,  ■  ■  ■ ,Pn)T 
denote  the  actual  performance  for  all  fusion  methods,  where  pi 
is  the  object  detection  probability,  i.e.,  the  ground  truth  quality 
score,  associated  with  the  image  obtained  from  the  ith  fusion  al¬ 
gorithm.  Let  a  given  FIQM  evaluated  over  N  fusion  algorithms 
be  denoted  as  a  N  X  1  vector  x  =  (x\  %xi,  ■  ■  ■ ,  x n)t  .  The  com¬ 
puted  value  Xi  is  a  deterministic  function  of  the  image  obtained 
from  the  ith  fusion  algorithm  and  the  F  source  images.  The  pro¬ 
posed  monotonic  hypothesis  test  evaluates  how  well  a  FIQM 
monotonically  relates  to  human  object  detection  performance. 
Under  the  monotonic  hypothesis,  there  is  a  monotonic  function 
that  maps  the  measure  value  xr  associated  with  the  ith  fusion 
algorithm  to  the  detection  probability  pi,  i.e., 

Pi  =  g{xi )  (1) 


where  g(x)  is  a  monotonic  increasing  or  decreasing  function 
of  x.  Let  p  and  x  denote  a  reordering  of  p  and  x  such  that 
the  measure  values  are  in  ascending  order,  i.e.,  x\  ^  x^  ^ 
...  x  ,v  •  Note  that  p  =  Pk p  and  x  =  P/,.x  where  Pk  is 

one  of  a  possible  N\  permutation  matrices.  This  paper  uses  the 
convention  that  Pi  is  the  identity  matrix  and  Pjy j  reverses  the 
original  ordering,  i.e.,  the  anti-diagonal  matrix  of  ones.  Now, 
we  consider  two  alternative  H±  hypotheses:  P]  for  ascending 
Pi  s  and  II  for  descending  p,’s.  On  the  other  hand,  the  null 
hypothesis  is  that  over  the  ensemble  of  possible  fused  imagery, 
the  Xi  s  are  i.i.d.  samples.  Thus,  the  p.;  ’  s  are  in  random  order 
where  the  probability  of  any  permutation  of  the  order  is  equal. 
In  other  words,  Pk  is  the  permutation  matrix  that  orders  the  jp  ’  s 
for  the  Hi  hypotheses,  and  Pk  is  randomly  chosen  via  a  uniform 
distribution  over  the  N\  possible  permutation  matrices  under  the 
null  (Ho)  hypothesis.  Namely,  the  conditional  probability  mass 
functions  (pmfs)  of  the  permutations  conditioned  on  p  and  the 
hypotheses  for  k  =  1, . . . ,  N\  are 


|p,  PT) 


1,  if  PfcP  G  PT 
0,  otherwise, 


MPk 

U(Pk  |  P,  Hi) 


f  1,  if  Pfc  P  G  PL 
f  0,  otherwise 
1 

m 


(2) 


where 

Pt  =  {p  :  0  <  pi  <  . . .  <  pjy  <  1} 

P{  =  {P  :  1  >  Pi  >  ■  ■  ■  >  Pn  >  0}.  (3) 

For  this  discusion,  it  is  also  convenient  to  define  P(l  as  the  set  of 
all  possibe  p’s,  i.e., 

P0  =  {p  :  0  <pi,...,pjv  <  1}.  (4) 

If  p  =  P/.p  is  observed,  then  the  likelihoods  of  the  hy¬ 
potheses,  i.e.,  KH,  |  p)  =  f(Pk  \p,H)  for  i  G  {T,j,0} 
demonstrate  that  if  p  is  not  in  ascending  (or  descending)  order, 
then  the  ascending  (or  descending)  likelihood  (and  likelihood 
ratio)  is  zero,  and  the  P|  (or  If  \ )  hypothesis  must  be  incorrect. 
Also,  if  p  happens  to  be  in  ascending  (or  descending)  order, 
then  either  the  H-\  (or  II  i )  hypothesis  is  true  or  the  ordering  of 
p  is  due  to  random  luck  under  the  null  hypothesis,  which  occurs 
with  a  probability  of  1  /TV! .  Thus,  for  p  G  V]  (or  p  G  Pp),  the 
likelihood  ratio  is  not  infinite,  i.e.,  a  sure  monotonic  relation¬ 
ship.  Rather,  it  is  N\  due  to  the  fact  that  the  random  x  can  order 
p  by  chance. 

Unfortunately,  the  value  of  p  (or  p)  is  unobservable.  It  can 
only  be  inferred  via  perception  experiments  that  measure  y  = 
(yi,y2,  ■  ■  ■ ,  Vn)T  where  yi  is  the  number  of  observers  that  cor¬ 
rectly  detect  the  targets  in  the  image  obtained  from  the  ith  fusion 
algorithm. 1  We  use  o;  to  represent  the  number  of  observers  that 
participate  in  the  detection  experiment  for  the  image  formed  by 
the  ith  fusion  image.  Under  the  assumption  that  all  human  are 
equally  capable,  it  is  reasonable  to  model  y  as  a  random  vector 
whose  elements  are  statistically  independent  where  yi  is  drawn 
from  a  binomial  distribution  with  parameters  Oi  and  jp  so  that 
the  pmf  of  y  conditioned  on  o  and  p  is 

N  /  \ 

y  ~  fv( y  I  °»  P)  =  II  (  yi  )  Pi^1  ~  Pi)0i~Vi  ■  ® 

Here  we  represent  the  o,;’s  in  an  N  x  1  vector  o  for  notational 
convenience.  Since  p  =  Pk p,  one  can  define  fy( y  |  o,  Pk,  p)  = 
fy( y  l°,p). 

The  joint  pmf  of  the  observations  y  and  the  permutations  Pk 
can  be  written  as 

fyiri y,  Pk  |  o,  p,  Hi)  =  /( y  |  Pfc,  o,  p,  Hi)f(Pk  |  o,  p,  Hi). (6) 

Because  y  conditioned  on  o  and  p  is  independent  of  iT;, 
/(y  |  Pk,o.  p.  II,  )  =  /y(y|Pfc,o,p)  for  all  Hi  s.  Further¬ 
more,  f(Pk  |  o,  p,  Hi)  =  fn(Pk  |  p,  Hi)  because  Pk  does  not 
depend  upon  o.  Thus,  fy7T(y,  Pk  |  o,  p.  II,  )  is  obtained  by  the 
multiplication  of  (2)  and  (5)  so  that 

fyAy-  Pk  |  o,  p,  Hi)  = 

fyAy-  Pk  |  O,  p,  Hi)  = 

fyAy,Pk\o,p,H0)  =  jp\fy(y  |o,Pfc,p).  (7) 

^or  variables  that  do  not  use  the  tilde,  the  indices  for  the  images  are  such 
that  x^’s  are  in  ascending  order. 


f  fy( y  I  o,  Pk,  Ph  if  PfcP  G  PT 
1 0,  otherwise 

f  fy( y  I  o,  Pk,  p),  if  PfcP  G  P| 
1 0,  otherwise 
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Then,  a  hypothesis  test  to  distinguish  Hr  or  from  Ho 
using  the  observed  values  can  be  derived  from  the  likelihoods 
l(Hi\y,o,Pk,p)  =  fy7r(y,Pk\o,p,Hi).  Because  p  =  Pk p 
is  not  observed,  the  hypothesis  test  is  a  composite  test.  It  is 
unclear  whether  a  uniformly  most  powerful  (UMP)  test  exists. 
A  common  test  to  apply  is  the  generalized  likelihood  ratio  test 
(GLRT).  This  requires  one  to  compute  the  maximum  likelihood 
(ML)  estimates  p  j  .  p  t .  po  for  the  /7)  ,  IP ,  and  Ho  hypotheses, 
respectively.  For  the  two  H\  hypotheses,  the  ML  estimates  can 
be  obtained  by  the  O(N)  pool  adjacent  violators  algorithm 
[17],  [29],  [30].  For  the  null  hypothesis,  pi  =  ( Ui/oi ).  The 
GLRT  has  the  property  that  for  any  ascending  (or  descending) 
y,  the  ascending  (or  descending)  generalized  likelihood  ratio 
(GLR)  is  AH.  Flowever,  if  the  y, ’ s  are  close  in  values,  the 
ordering  is  more  likely  to  be  due  to  luck  than  when  the  ?/,  ’  s 
are  well  spread.  However,  the  GLRT  is  unable  to  make  this 
distinction  between  different  ordered  y’s.  A  different  approach 
that  accounts  for  the  relative  spread  of  the  observations  values 
is  needed. 

B.  Diffuse  Prior  Monotonic  Likelihood  Ratio  Test 

A  given  scene  is  a  realization  from  the  ensemble  of  possible 
source  images.  Therefore,  it  is  reasonable  to  model  the  detection 
probabilities  as  being  drawn  from  a  random  distribution,  i.e., 
P  ~  fp( P)-  The  diffuse  prior  monotonic  likelihood  ratio  test 
(DPMLRT)  assumes  that  for  a  given  scene,  p  is  a  realization  of 
an  uninformative  (or  diffuse)  prior  distribution,  i.e,,  the  elements 
Pi  are  i.i.d.  uniform  [0, 1)  so  that  fp( p)  =  1.  The  uniform  distri¬ 
bution  models  the  fact  that  the  imagery  are  collected  in  various 
conditions  where  the  ability  to  detect  the  objects  can  be  easy, 
hard,  or  somewhere  in  between.  The  independence  between  fu¬ 
sion  methods  is  a  simplifying  assumption  that  leads  to  a  com¬ 
putationally  feasible  test.  Because  the  prior  on  p  is  independent 
of  the  hypothesis  Hi  and  o,  we  have  /( p  |  o,  Hi)  =  fp( p)  =  1. 
Then,  p  is  marginalized  so  that  the  expected  likelihood  for  the 
«th  hypothesis  is 


l(Hi\y,o,Pk)  =  /  fyn(y,  Pk  |  o,  p,  Hi)f(p  |  o,Hi)dp 

JVo 


=  /  fyn(y,Pk  \o,p,Hi)dp. 
JVo 


(8) 


scene  p  is  drawn  from  the  uninformative  prior,  then  the  fol¬ 
lowing  LRTs  are  optimal  in  the  Neyman-Pearson  sense  [31]  for 
distinguishing  the  monotonically  ascending  or  descending  hy¬ 
pothesis  from  the  null  hypothesis2 


Ajv(y,o) 


Ajv(y>°) 


N'.  ,fv„  /( y  |  o,  p)dp 
JVof(y\o,p)dp 

N '■  ,(p_  f(y  I  o,  p)dp 
jVn  f(y  I  o,  p)dp 


(10) 


We  refer  to  Ajy  and  Ay  as  the  ascending  and  descending  diffuse 
prior  monotonic  likelihood  ratio  (DPMLR),  respectively. 

For  multiple  scenes,  the  nature  of  the  monotonicity  (as¬ 
cending  or  descending)  should  be  consistent  from  scene  to 
scene.  Therefore,  one  must  consider  the  cumulative  likelihoods 
for  the  ascending,  descending,  and  null  hypotheses.  Since  we 
assume  that  the  y’s  and  p’s  are  statistically  independent  from 
scene  to  scene,  the  likelihoods  for  each  hypothesis  accumulate 
via  the  product  operation.  The  cumulative  likelihood  ratios  are 
then  proportional  to  the  geometric  mean  of  the  likelihood  ratios 
for  each  scene.  The  geometric  mean  provides  a  convenient 
way  to  normalize  the  score  against  the  number  of  scenes.  The 
overall  likelihood  ratio  for  the  monotonic  relationship  over  S 
scenes  is  formally  defined  as 


An 


{s  s 

nA]v(ys,os),IlAjv(ys,os) 

S=1  S=1 


1/S 


(ID 


where  ys  and  os  are  the  number  of  correct  detections  and  ob¬ 
servers  for  the  sth  scene,  respectively.  Note  that  An  is  agnostic 
to  the  nature  of  the  monotonicity.  Unless  it  is  required,  the  scene 
index  is  implicit  for  the  sake  of  notational  brevity.  We  refer  to 
A n  as  the  composite  DPMLR.  When  An  >  1  the  evidence  in 
support  of  the  monotonic  hypothesis  is  greater  than  that  of  the 
null  hypothesis  where  the  F1QM  behaves  as  noise  with  respect  to 
human  performance.  As  A n  increases,  so  does  the  evidence  that 
the  F1QM  under  test  is  actually  a  good  measure.  The  DPMLRT 
is  simply  accepting  the  monotonic  hypothesis  if  the  DPMLR  ex¬ 
ceeds  a  given  threshold  value.  Usually,  the  threshold  is  greater 
than  one. 


Now  the  expected  likelihoods  do  not  depend  upon  any  unob¬ 
servable  parameters.  The  integral  in  (8)  can  be  simplified  by 
noting  that  the  integrand  is  given  by  (7)  and  using  the  change 
of  variable  p  i— »  P~ 1  p.  Then,  it  is  easy  to  see  that 

| y, °, Pk)  =  f  fy(y\o,p)dp 

Jv, 

i(Hi\y,o,Pk)  =  I  fy(y\o,p)dp 
■lrt 

l(H0  |  y,  o,  Pk)  =  fy(y\  °-  P)dP-  (9) 

Now,  the  tests  to  distinguish  the  Hi  hypotheses  from  the  null 
hypothesis  are  simple  hypothesis  tests,  and  the  likelihood  ratio 
test  (LRT)  is  the  most  powerful  test.  Namely,  given  that  for  each 


C.  Recursive  Computation 

To  our  knowledge,  a  closed  form  expression  for  (10)  does  not 
exist,  and  numerical  integration  quickly  becomes  infeasible  as 
N  increases.  Fortunately,  it  is  possible  to  calculate  the  diffuse 
likelihood  ratios  numerically.  However,  due  to  the  multivariable 
integration  involved  in  the  expression,  the  calculation  requires 
large  computational  cost,  especially  when  N  and  the  of  s  are 
large.  This  subsection  provides  a  recursion  to  calculate  these 
diffuse  likelihood  ratios. 

The  diffuse  likelihood  for  Ho  can  be  simply  expressed  as 

N  /  \ 

i(H0  \y-o)  =  Y[iy)  fd(yi  -  T.  Oi  -  m  -  1)  (12) 

2=1  ''  ' 

2For  notational  convenience,  the  dependency  of  A  to  the  ordering  Pk  is  left 
implicit  since  A  is  actually  invariant  to  Pk  except  in  how  it  orders  y. 
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where 


_1(1 


\b-i 


dz 


(13) 


is  the  Beta  function. 

Substituting  equations  (5),  (8)  and  (12)  into  (10),  the  as¬ 
cending  diffuse  likelihood  ratio  can  be  expressed  as 


\ k(y>° ) 


Nl  JV]  h(pN ;  yN,oN) . . .  %i;  t/i, o2)dp 

nf=i  (Kui  +  1  ,Oi-  Vi  +  1) 

(14) 


where 


h(p\y,o)  =py(l -p)°  y.  (15) 


By  considering  the  power  series  expansion  of  the  regularized 
incomplete  Beta  function,  the  calculation  of  A  v(y.  o)  can  be 
simplified  in  a  recursive  way.  Specifically,  the  regularized  in¬ 
complete  Beta  function  is  defined  as 


I(y\ a,  b) 


za~x {l  -  z)b~l  dz 
0(a,b ) 


(16) 


and  the  power  series  expansion  for  I (y:  a ,  b)  is 

1 


i(y;a,b )  = 


a  +  b 

a+6— 1  ^ 

J - U 


a+6— 1—  j 


(17) 


Then,  (14)  can  be  written  as  (18),  shown  at  the  bottom  of  the 
page.  Now  substituting  (17)  into  (18),  we  obtain  (19),  shown  at 
the  bottom  of  the  page. 


Also  from  (3),  one  can  see  that  V-  and  Vo  are  the  same  when 
N  —  1.  Therefore,  by  definition,  we  have 

\\(yll0l)  =  l  (20) 


and  the  ascending  diffuse  likelihood  ratio  can  be  computed  nu¬ 
merically  via  the  recursion  defined  in  (19)  and  (20).  A  similar 
recursion  can  compute  the  descending  diffuse  likelihood  ratio. 
Alternatively,  one  can  use  the  symmetry  property  (see  Property 
2  in  the  next  subsection)  to  derive  AL  from  the  computation  of 


\T 

an- 


D.  Properties 

The  diffuse  likelihood  ratios  demonstrate  a  number  of  inter¬ 
esting  properties  than  can  easily  be  proven.  Some  of  these  prop¬ 
erties  are  for  the  general  case  where  the  number  of  observers  can 
vary  over  the  different  fused  images.  Other  properties  are  for 
the  case  that  the  number  of  observers  is  constant,  i.e.,  ot  =  o. 
This  more  specific  case  that  o  =  ol  is  common  for  percep¬ 
tion  experiments  where  one  would  expect  the  evaluation  of  the 
fused  imagery  over  the  same  number  of  observers.  In  addition 
to  these  provable  properties,  we  have  discovered  other  inter¬ 
esting  attributes  for  the  DPMLR  by  exhaustively  computing  the 
DPMLRs  for  all  (o+l)A  values  of  y  for  manageable,  i.e.,  small, 
values  of  o  and  N.  These  attributes  make  sense  based  upon  the 
intuition  of  how  the  DPMLRT  should  behave;  we  speculate  that 
these  attributes  are  preserved  for  larger  values  of  o  and  N;  and 
we  are  willing  to  go  out  on  a  limb  by  disseminating  them  as  con¬ 
jectures  in  this  subsection.  We  hope  that  proofs  will  be  discov¬ 
ered  in  the  future  so  that  the  conjectures  can  become  properties. 

This  section  first  presents  the  properties  that  are  valid  for  gen¬ 
eral  values  of  o. 

Property  1:  A^(y, o),  A+y, o),  Aw  <E  (0,  AT!). 

The  proof  of  this  property  can  be  found  in  Appendix  A.  The 
property  bounds  the  possible  values  of  the  diffuse  likelihood 


N\  Jo  •  •  •  Jo 3  Kpn\  vn,on)---  h(p2\y2, 02)  ( Jq 2  Hpv  Vu  Qi)dpi)  dp2...  dpN 

f3(yi  +  t,  01  -  yi  +  1)  nf=2  0(Vi  -  1\  »j  -  y,  -  1) 

_  Jq1  •  •  •  Jq  3  h(pN;  Vn,  on)  ■  ■  ■  h(p2\y 2, 02)7+2;  yi  -  l  »i  -  y\  -  1  )dp2  ■ .  ■  dpN 

nf=2  + 1  ,oi-yi  + 1) 


(18) 


N\  y'1  p(j  +  2/2  +  1,  Ol  +  02  +  2  -  2/2  -  j) 

01+2  j+^+1  0(J  +  1,  Ol  +  2  -  j)0(y2  +  1, 02  -  y2  +  1) 

Jq1  ■■■  Jq3  h(PN\ dN,  ON)  ■  •  ■  KP2\ j  +  Wl;  Ol  +  02  +  1)  dp2  •  •  ■  dpN 

nf=3  +  1  ,Oi-yi  +  1  )0(j  +  2/2  +  1,  Ol  +  02  +  2  -  y2  -  j ) 

N\  y'1  0{j  +  rn  +  1,  Ol  +  02  +  2  -  y2  -  j) 

oi  +  2  . J^+i  f3(j  +  1,  oi  +  2  -  j)0(y2  +  l,o2-y2  +  1) 

X  AJIvr_1  ([j  +y2,y3,...,  yN]\  [ox  +  o2  +  1,  o3, . . . ,  ojv]') 


(19) 
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ratios.  As  the  number  of  objects  N  to  consider  increases,  the 
upper  bound  for  the  likelihood  ratios  grows  fast.  For  a  given 
value  of  N  and  o,  the  bounds  of  zero  and  AH  are  loose  since  the 
set  of  all  possible  values  of  y  is  finite.  However,  as  demonstrated 
later  in  this  subsection,  as  the  number  of  observers  increases, 
one  can  find  a  y  that  corresponds  to  a  likelihood  ratio  value  that 
is  arbitrarily  close  to  either  bound.  In  other  words,  as  the  number 
of  observers  increases  and  the  yf  s  have  sufficient  spread,  the 
likelihood  ratio  becomes  as  if  p  is  observable  (see  Section  II- A). 
Property  2:  Ajy(y,o)  =  \\(PN\y ,PN\o)  =  Ajy(o-y.o). 

Proof:  The  first  equality  is  the  result  of  a  simple  change 
of  variables  p  i— s-  PmP  in  (14).  Likewise,  the  second  equality 
is  the  result  of  the  change  of  variables  pi  i— s-  1  —  p.;  for 
i  =  1, ...  ,N  in  (14)  followed  by  a  reversal  of  the  order  of 
integration.  ■ 

This  property  demonstrates  a  symmetry  between  Ajy  and 
Ajy.  The  symmetry  provides  a  convenient  way  to  derive  the  de¬ 
scending  likelihood  ratio  via  the  computation  of  the  ascending 
likelihood  ratio  and  vice  versa. 

The  first  two  properties  are  valid  for  a  variable  amount  of  ob¬ 
servers  per  a  fused  image.  The  final  set  of  properties  are  specific 
for  the  case  that  a  constant  number  of  observers  o  are  utilized 


for  the  N  fused  images,  i.e.,  o  =  ol. 

Property  3:  If  y1  =  y2  =  •••  =  yN,  then  Ajv(y,ol)  = 
Ajv(y,ol)  =  1. 

The  proof  of  this  property  is  given  in  Appendix  B.  The 
property  states  that  when  all  observations  are  equal,  one 
cannot  distinguish  between  the  ascending,  descending,  and 
null  hypotheses  because  all  orderings  of  the  observations  are 
indistinguishable.  Clearly,  when  all  observations  are  the  same, 
it  is  an  ill-posed  problem  to  determine  whether  or  not  the 
FIQMs  are  ordering  the  fused  imagery  in  any  special  manner. 

Property  4:  If  the  yf  s  are  in  ascending  order  and  they  are  not 
constant  then  \]N(PN\y,  PN\o)  <  Ajv(Pfcy,Pfco)  <  A]v(y,o) 
for  1  <  k  <  N\.  Likewise,  if  the  yf  s  are  in  descending 
order  and  they  are  not  constant  then  Ajv(Pjviy,  Pn\ °)  < 
XlN(Pky,Pko)  <  Ajy(y,  o)  for  1  <  k  <  AH. 

Property  5:  If  the  yf  s  are  in  ascending  order  and  they  are 
not  constant,  then  Ajy(y.o)  >  1  and  Ajyfy.  o)  <  1.  Likewise, 
if  the  yf  s  are  in  descending  order  and  they  are  not  constant,  then 
Ajv(y,o)  >  1  and  Ajy(y.o)  <  1. 

The  proof  of  these  two  properties  is  provided  in  Appendix  C. 
Property  4  states  that  if  the  observations  demonstrate  a  perfect 
monotonic  ascending  relationship  with  the  FIQM,  then  the  as¬ 
cending  likelihood  ratio  is  larger  than  that  for  any  other  ordering 
of  the  observations.  Furthermore,  the  descending  order  of  obser¬ 
vations  demonstrates  the  lowest  ascending  likelihood  ratio  of  all 
possible  orderings.  Since  it  can  be  shown  that  the  average  like¬ 
lihood  ratio  over  all  possible  orderings  of  the  observations  is 
one.  Property  5  is  a  corollary  of  Property  4.  The  property  states 
that  as  long  as  the  human  performance  y  is  increasing  in  con¬ 
cert  with  x,  the  diffuse  likelihood  ratio  will  always  favor  the 
ascending  //)  and  disfavor  the  descending  /7j  hypotheses  over 
the  null  hypothesis  .  Similarly,  as  long  as  the  human  perfor¬ 
mance  y  is  decreasing  in  concert  with  x,  the  diffuse  likelihood 
ratio  will  always  favor  the  descending  H [  and  disfavor  the  as¬ 
cending  //;  hypotheses  over  the  null  hypothesis  Ho.  Clearly, 
these  two  properties  are  both  intuitively  appealing. 


Conjecture  1:  The  product  Ajy(y,  ol) -Ajy (y,  ol)  <  1  where 
equality  occurs  if  and  only  if  Ajy(y,  ol)  =  Ajy(y,  ol)  =  1. 

As  stated  earlier,  this  conjecture  is  the  result  of  searching 
through  an  exhaustive  list  of  (o+l)^  monotonic  likelihood  ratio 
values  for  manageable  values  of  o  and  N .  This  conjecture  states 
that  the  ascending  and  descending  hypotheses  can  never  both  be 
favored  over  the  null  hypothesis.  In  other  words,  Ajy  >  1  im¬ 
plies  Ajy  <  1,  and  Ay  >  1  implies  Ajy  <  1.  However,  the  con¬ 
verse  is  not  true.  It  is  possible  that  for  a  given  y  both  Ay  and  Ay 
can  be  less  than  one.  As  a  simple  example,  consider  y  =  [0  2  0] 
for  o  =  2.  Because  of  the  symmetry  property,  Ay  =  Ajy.  At 
best,  a  symmetric  y  can  have  a  monotonic  likelihood  ratio  of 
one  when  all  the  yf  s  are  constant.  Otherwise,  the  symmetric  y 
is  neither  ascending  or  descending  and  should  not  provide  evi¬ 
dence  to  support  //]  or  Hi  over  Ho.  For  this  case,  the  ascending, 
descending,  and  composite  DPMLRs  are  all  0.2286. 

Conjecture  2:  Ay(y,  ol)  =  1  (or  Ajy(y,ol)  =  1)  if  and 
only  if  the  yf  s  are  constant. 

This  conjecture  states  that  the  only  way  for  the  ascending  (or 
descending)  hypothesis  to  be  indistinguishable  from  the  null 
hypothesis  is  for  all  the  observations  yt  to  be  the  same.  Fur¬ 
thermore,  if  the  ascending  hypothesis  cannot  be  distinguished 
from  the  null  hypothesis  then  the  same  is  true  for  the  descending 
hypothesis. 

Conjecture  3:  For  a  given  N,  the  bounds  in  Property  1  are 
tight  in  the  sense  that  one  can  identify  a  value  of  o  and  corre¬ 
sponding  y  whose  monotonic  likelihood  ratio  is  arbitrarily  close 
to  either  the  lower  bound  of  zero  or  the  upper  bound  of  AH. 

Inspection  of  the  exhaustive  list  of  monotonic  likelihood  ra¬ 
tios  of  possible  y’s  for  small  values  of  N  and  o  has  revealed  that 


o,  i  <  N/2 
0,  i  ^  N/2 


(21) 


achieve  close  to  the  maximum  and  minimum  values  of  Ajy,  re¬ 
spectively,  for  a  given  value  of  N  and  o.  A  different  rounding 
function  in  (21)  may  lead  to  a  higher  Ay.  Intuitively,  as  the 
values  of  the  yf  s  spread  apart,  the  discriminability  between  the 
hypotheses  improves.  If  the  observations  use  the  entire  dynamic 
range  of  o  and  they  increase  linearly  with  respect  to  the  rank 
order,  then  it  makes  sense  that  Ajy  is  as  large  as  possible.  Since 
maximizing  Ajy  also  maximizes  Ajv  due  to  (11)  and  the  sym¬ 
metry  property,  y  also  achieves  close  to  the  maximum  of  Ajy. 
For  a  small  Ajy,  the  yf  s  should  be  decreasing  and  y  has  the 
maximum  drop  possible.  While  y  leads  to  a  small  Ajy,  its  cor¬ 
responding  Ajv  value  is  greater  than  one  because  it  is  monoton- 
ically  descending  [see  (11)].  The  observation  sequence 


ft  =  (1  -  (-l)>/2  (22) 


achieves  close  to  the  minimum  value  of  A  y  for  a  given  value  of 
N  and  o.  It  is  neither  increasing  nor  decreasing  and  utilizes  the 
dynamic  range  of  o.  Table  I  demonstrates  how  these  sequence 
are  converging  to  the  lower  and  upper  bounds  for  Ay  and  Ajv 
as  o  increases  for  a  given  N.  The  symmetry  properties  can  be 
used  to  show  similar  results  for  Ayr. 

In  summary,  the  evidence  to  accept  the  H\  hypothesis 
(DPMLR  >  1)  or  the  null  hypothesis  (DPMLR.  <  1)  in¬ 
creases  as  the  number  of  observers  increases  because  the  spread 
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(a) 


(b) 


(c) 


Fig.  1 .  ROC  curves  for  DPMLR,  MC,  Pearson  correlation,  and  logistic  correlation  tests,  (a)  a  =  1 .  (b)  a  =  4.  (c)  a  =  G. 


TABLE  I 

Values  of  y,  y,  and  y  Show  that  A  and  An  Can  Approach  Their 
Bounds  of  Zero  and  AG!  as  the  Number  of  Observers  o  Increases 


o 

N 

5 

10 

20 

40 

80 

\T 

an 

(y- oi) 

3 

5.27 

5.93 

6.00 

6.00 

6.00 

5 

33.88 

71.59 

104.38 

117.30 

119.87 

\t 

an 

(y, ol) 

3 

1 ,62e-004 

1 ,55e-008 

1.09e-016 

3.93e-033 

3.71e-066 

5 

1.16e-007 

7.69e-015 

2.58e-029 

2.1 2e-058 

1 .04e- 1 1 6 

A  N 

(y-oi) 

3 

6.  !7e-003 

8.47e-006 

1.  lle-011 

1.41e-023 

1 ,64e-047 

5 

4.39e-005 

9. 1 4e-0 1 1 

1.71  e-022 

2.91e-046 

4.08e-094 

of  possible  DPMLRs  increases.  Furthermore,  if  y  happens 
to  exhibit  a  perfect  monotonic  ordering,  then  the  evidence  to 
support  Hi  also  increases  as  the  spread  of  the  y, ’ s  increases.  In 
other  words,  the  chances  of  measurement  errors  leading  to  er¬ 
rors  in  inferring  the  wrong  hypothesis  decreases  as  the  number 
of  observers  increases.  The  performance  of  the  DPMLRT  in 
terms  of  hypothesis  errors  is  evaluated  by  Monte  Carlo  simula¬ 
tions  in  the  next  section. 

III.  DPMLRT  Performance  Analysis 

In  this  section,  we  justify  the  performance  of  the  proposed 
DPMLRT.  To  this  end,  we  generate  Monte  Carlo  realizations  of 
y,  x,  and  p.  Specifically,  the  pi’s  are  generated  uniformly  over 
[0, 1).  For  the  monotonic  hypothesis,  Xi  =  (pi)a-  For  the  null 
hypothesis,  the  xps  are  i.i.d.  from  a  uniform  distribution.  For 
either  hypothesis,  the  ry ’  s  are  random  realizations  of  the  bino¬ 
mial  distribution  (see  (5)).  For  a  given  hypothesis  and  values  of 
ol,  N,  and  a,  we  generated  106  realizations  of  y,  x,  and  p, 
and  we  computed  the  associated  DPMLR  given  one  scene,  i.e., 
5  =  1.  Then,  we  use  the  histograms  of  the  DPMLR  to  generate 
ROC  curves  by  varying  the  acceptance  threshold  and  tabulating 
the  number  of  acceptances  under  the  monotonic  hypothesis,  i.e., 
probability  of  detection  (Pd),  and  under  the  null  hypothesis,  i.e., 
probability  of  false  alarms  (P/).  As  a  means  of  comparison,  we 


(a)  (b) 

Fig.  2.  ROC  curves  for  DPMLR,  MC,  Spearman  correlation,  Kendall  correla¬ 
tion  and  logistic  correlation  tests,  (a)  o  =  10.  (b)  o  =  30. 


(a)  (b) 

Fig.  3.  ROC  curves  for  DPMLRT  for  various  values  of  N  and  o.  (a)  N  =  10, 
and  o  =  5, 10,  20,  or  30.  (b)  o  =  5  and  .V  =  5, 10, 15,  or  20. 


also  compute  ROC  curves  associated  with  some  other  correla¬ 
tion  tests  in  a  similar  fashion  over  the  same  simulations. 

Fig.  1  includes  ROC  curves  of  the  DPMLR,  the  MC  [17], 
the  Pearson  correlation  and  the  logistic  correlation  [23],  [27] 
tests  for  various  values  of  a  when  N  =  10  and  o  =  5.  Inter¬ 
ested  readers  are  referred  to  [17]  for  a  detailed  description  of  the 
monotonic  and  logistic  correlations.  In  Fig.  1(a),  where  a  =  1, 
the  Pearson  correlation  performs  better  than  the  others.  This  is 
explained  by  the  fact  that  the  relationship  between  x  and  p  is  ac¬ 
tually  linear,  and  Pearson  correlation  exploits  the  actual  values 
of  x  and  not  just  the  ordering.  In  essence,  the  test  for  linearity 
is  better  in  this  case  than  the  more  general  test  of  monotonicity 
because  it  exploits  more  information.  As  the  g(x )  function  be¬ 
comes  more  nonlinear  (i.e.,  a  increases),  the  performance  of  the 
Pearson  correlation  degrades  significantly.  Clearly,  the  logistic 
correlation  is  more  robust  to  the  nonlinearity  than  the  Pearson 
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correlation,  but  since  not  all  monotonic  relations  follow  a  lo¬ 
gistic  function,  the  logistic  correlation  performs  worse  than  the 
MC.  Note  that  a  does  not  change  the  ordering  of  xp  s.  There¬ 
fore,  the  performance  of  the  DPMLRT  and  the  MC  is  invariant 
to  the  nonlinearity.  The  DPMLRT  always  outperforms  the  MC 
because  for  the  case  of  the  uniform  prior  on  p,  the  DPMLRT  is 
the  most  powerful  test  of  monotonicity. 

We  also  consider  some  common  rank-order  correlations — the 
Spearman  correlation  and  the  Kendall  correlation  when  N  = 
10  and  a  =  6.  The  ROC  curves  for  different  tests  are  shown 
in  Fig.  2.  Fig.  2(a)  corresponds  to  a  case  where  o  =  30  and 
Fig.  2(b)  corresponds  to  a  case  where  o  =  10.  The  DPMLRT 
always  outperforms  the  others  as  expected  given  that  the  p’s 
are  generated  by  the  assumed  prior  distribution.  As  the  number 
of  observers  increases,  the  gap  between  the  ROC  curves  of  the 
DPMLRT  and  the  rank-order  correlation  tests  becomes  larger. 
When  o  =  10,  the  performance  of  the  MC  is  a  little  poorer 
than  that  of  the  rank-order  correlations.  For  larger  o(o  =  30), 
the  MC  outperforms  the  rank-order  correlations  because  it  takes 
advantage  of  the  values  of  yPs  while  the  rank-order  correlations 
only  use  their  rank  information.  The  logistic  correlation  exhibits 
the  worst  performance  because  of  the  limitation  of  the  logistic 
regression  fitting. 

Fig.  3  provides  the  ROC  curves  of  the  DPMLRT  for  different 
o’s  and  TV’s.  The  circle  on  each  curve  denotes  the  operating 
point  when  the  threshold  is  set  to  one.  As  shown  in  [31],  the 
slope  of  the  ROC  curve  for  a  LRT  is  equal  to  the  corresponding 
threshold  value.  Thus,  when  the  threshold  is  one,  the  slope  is 
one  corresponding  to  the  “knee”  of  the  ROC  curve  as  demon¬ 
strated  in  Fig.  3,  which  uses  a  linear  scale  for  the  Pf- axis.  As 
one  increases  the  number  of  observers,  the  knee  of  the  ROC 
curve  shifts  to  the  top  left  corner,  which  means  higher  /]/  and 
lower  Pf  can  be  achieved  for  a  threshold  of  one.  As  expected, 
as  the  number  of  fused  images  N  or  the  number  of  observers  o 
increases,  the  efficacy  of  the  DMPLR  improves. 

The  next  set  of  simulations  consider  how  the  DPMLR  per¬ 
forms  when  the  model  assumptions  do  not  match  the  data.  For 
these  simulations,  N  =  10,  o  =  10,  and  a  =  6.  The  first  case 
considers  uniform  random  variables  pr  ’  s  with  a  prespecified  cor¬ 
relation  matrix  E,  whose  {rri,  n) th  element  denotes  the  correla¬ 
tion  coefficient  of  pm  and  pn  (1  ^  m,n  ^  N).  The  method 
for  generating  such  p,  ’sis  from  [32].  In  this  case  we  denote  the 
nondiagonal  elements  of  E  by  p  (the  diagonal  elements  equal  1). 
The  pi  ’  s  are  completely  correlated  or  independent  for  p  =  1  or 


(a)  (b) 


Fig.  5.  ROC  curves  under  generalized  binomial  distribution,  (a)  r  =  0.5. 
(b)  r  =  1 . 


p  =  0,  respectively.  Fig.  4  compares  the  ROC  curves  of  DPMLR 
with  the  other  correlations  for  different  p’s.  Fig.  4(a)-(c)  cor¬ 
respond  to  p  =  0.1, 0.5  and  0.9,  respectively.  By  comparing 
these  ROC  curves  to  Fig.  2,  we  can  see  that  the  gap  between  the 
DPMLRT  and  the  others  decreases  as  p  increases.  But  clearly 
the  DPMLRT  exhibits  the  best  performance  among  these  corre¬ 
lations.  In  the  limit,  as  p  goes  to  1,  the  monotonic  evaluation  is 
moot  as  all  values  of  the  pi’s  are  equal. 

The  next  case  considers  the  effect  when  the  model  of  human 
performance  does  not  match  the  binomial  distribution.  We  con¬ 
sider  the  generalized  binomial  distribution  [33]  to  incorporate 
diversity  in  the  capabilities  of  humans.  Specifically,  the  nom¬ 
inal  human  performance  pi  and  associated  FIQM  =  (pi)a 
are  generated  as  usual.  Then,  the  realized  mean  performance 
for  the  observers  pi  is  drawn  from  the  uniform  distribution  over 
[Pi  ~  rpi(  1  -  pi), pi  +  Tpi{  1  -  pi)]  and  yr  is  drawn  from  a  bi¬ 
nomial  distribution  with  parameters  o  andp^.3  Here  r  C  [0, 1]  is 
referred  to  as  the  spread  parameter,  which  denotes  the  deviation 
oiyi’s  distribution  from  the  binomial  distribution.  Note  that  for 
t  =  0,  yi  still  follows  the  binomial  distribution  with  parame¬ 
ters  o  and  pi.  Fig.  5  shows  the  ROC  curves  of  the  DPMLRT,  the 
monotonic,  the  rank-order  and  the  logistic  correlation  tests  for 
different  spread  parameters  r.  This  figure  demonstrates  that  the 
DPMLRT  is  robust  to  r  and  still  outperforms  the  others  even 
when  r  is  as  large  as  one. 

The  final  case  demonstrates  that  the  DPMLR  is  not  the  UMP 
for  any  arbitrary  prior  distribution.  Consider  a  pathological  case 

3  As  discussed  in  [33],  any  pmf  of  //,  over  [0,  o]  can  be  generated  by  choosing 
a  specific  pdf  to  generate  )>  , . 
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Fig.  6.  Example  of  the  case  in  which  Pi  s  are  at  the  edge  of  under  H±. 

in  which  for  odd  i,  pi  =  pi+i  and  pi  is  drawn  from  a  uni¬ 
form  distribution  over  [0, 1).  In  practice,  this  case  is  unlikely 
because  it  means  that  two  different  fusion  methods  provide  im¬ 
ages  with  equivalent  performance  over  multiple  scenes.  Nev¬ 
ertheless,  Fig.  6  compares  the  DPMLRT  with  the  other  corre¬ 
lations.  The  figure  shows  that  the  DPMLRT  outperforms  the 
others  when  the  F\i  is  high.  But  the  other  correlations  all  achieve 
higher  detection  probability  than  the  DPMLRT  when  Pf  is  less 
than  0.01. 

IV.  FIQM  Evaluation  via  the  DPMLRT 

This  section  demonstrates  the  application  of  the  DPMLRT  to 
score  potential  FIQMs.  The  DPMLRT  and  some  other  correla¬ 
tions  are  used  to  evaluate  the  monotonic  relationships  between 
17  different  FIQMs  proposed  in  the  literature  and  the  human  de¬ 
tection  results  in  a  specific  target  detection  experiment.  Details 
of  the  experiment  and  the  discussion  about  the  evaluation  results 
are  provided  in  the  following  subsections. 

A.  Experimental  Setup 

Long-wave  infrared  (LWIR)  and  image  intensified  (II)  im¬ 
agery  were  collected  in  a  simulated  military  operation  in  an 
urban  terrain  (MOUT)  environment.  The  imagery  includes 
six  interior  and  exterior  locations,  where  four  scenarios  were 
collected  for  each  location.  The  four  scenarios  represent  cases 
where  zero,  one,  two,  and  three  people  are  within  the  field 
of  regard  of  the  camera.  Individuals  who  were  in  the  field  of 
regard  were  typically  obscured  by  objects  in  the  scene,  such 
as  doorways,  windows,  furniture,  and  tables.  For  each  of  the 
scenarios,  a  horizontal  pan  of  150  images  was  then  used  to 
create  a  larger  mosaic  of  imagery  in  both  the  LWIR  and  II 
bands. 

The  perceptual  goal  for  the  human  observers  is  to  detect  the 
target  in  the  scenes  by  interrogating  the  fused  imagery.  To  gen¬ 
erate  the  imagery,  the  LWIR  and  II  images  were  registered, 
bore-sighted  and  fused  via  six  different  algorithms:  1)  contrast 
pyramid  A  (CONA),  2)  contrast  pyramid  B  (CONB)  [34],  3)  dis¬ 
crete  wavelet  transform  (DWTT)  [1],  [35],  [36]  4)  color  discrete 
wavelet  transform  (CDWT),  5)  color  averaging  (CLAV),  and 
6)  color  multiscale  transform  (CLMT)  [37].  The  first  three  al¬ 
gorithms  generate  grayscale  fused  images,  and  the  final  three 
methods  generate  color  fused  images.  It  is  worth  mentioning 
that  the  distinction  between  CONA  and  CONB  is  which  image 
(LWIR  or  II)  populates  the  coarsest  coefficients  in  the  pyramid. 
Also,  the  color  methods  generate  a  grayscale  fusion  method  for 
the  luminance  component,  map  the  differences  in  the  image  co¬ 
efficients  in  the  saturation  component,  and  encode  the  source 


of  the  largest  coefficient  (LWIR  versus  II)  in  the  hue  compo¬ 
nent.  The  CDWT  uses  this  coloring  scheme  for  the  DWT  coef¬ 
ficients,  the  CLAV  uses  simple  averaging  for  the  luminance  and 
the  raw  pixels  for  the  color  components,  and  the  CLMT  uses 
the  coloring  scheme  for  the  multiscale  fusion  method  defined 
in  [37].  Finally,  it  is  instructive  to  compare  the  fused  imagery 
against  the  source  imagery.  Therefore,  we  consider  eight  fused 
image  displays:  1)  II,  2)  LWIR,  3)  CONA,  4)  CONB,  5)  DWTT, 
6)  CDWT,  7)  CLAV,  and  8)  CLMT. 

Fig.  7  shows  an  example  of  the  resulting  eight  fused  image 
displays  for  a  typical  scenario  in  our  experiment.4  In  this  sce¬ 
nario,  there  are  two  target  persons  which  are  highlighted  by  the 
boxes  in  each  image.  As  seen  in  Fig.  7(b),  the  human  targets 
stand  out  in  the  LWIR  imagery  because  they  are  usually  hotter 
than  the  background.  For  the  most  part,  detection  performance 
is  best  on  the  LWIR  only  band  because  the  search  task  can  often 
be  reduced  to  simply  finding  the  white  hot  object  on  a  grey  back¬ 
ground.  However,  the  II  band  has  the  potential  to  add  context  to 
the  LWIR  band  as  the  objects  like  tables  and  chairs  are  easier 
to  distinguish  in  the  II  band  [see  Fig.  7(a)  and  (b)].  Therefore, 
there  can  be  value  in  fusing  the  two  bands. 

A  perception  test  was  set  up  whereby  observers  were  asked  to 
try  to  find  the  human  targets  in  a  “field  of  regard”  search.  An  ob¬ 
server’s  display  was  calibrated  to  look  as  though  it  were  seeing 
a  single  field  of  regard  of  a  given  scene,  and  the  observer  had  to 
navigate  across  the  scene  and  detect  human  targets.  Observers 
could  mark  as  many  as  three  places  on  the  display  as  detections 
for  human  targets  (as  they  were  told  that  the  images  could  con¬ 
tain  between  zero  and  three  humans  hiding  in  the  scene).  At  any 
point  an  observer  could  push  a  button  to  indicate  that  they  either 
did  not  detect  any  targets  in  the  scene  or  that  there  were  no  other 
targets  in  the  scene.  In  the  end,  the  detection  performance  of  the 
humans  was  recorded  over  the  eight  image  displays. 

Overall,  o  =  8  observers  evaluated  18  scenarios  that  con¬ 
tained  35  human  targets.  We  treat  each  target  and  its  surrounding 
area  as  a  scene  for  every  scenario.  For  example,  the  inside  of 
each  box  in  Fig.  7  represents  a  scene,  as  shown  in  Fig.  8(a)-(h). 
Then,  ys  is  the  number  of  observers  that  correctly  detected  the 
target  located  in  the  sth  scene  for  s  =  1, . . . ,  35. 

it.  Evaluated  FIQMs 

We  test  17  potential  FIQMs  over  each  scene.  These  FIQMs 
are  listed  in  Table  II  with  corresponding  citations.  Most  mea¬ 
sures  listed  in  Table  II  were  also  evaluated  in  [  17]  for  a  recog¬ 
nition  task.  All  the  measures  except  the  first  are  computed  auto¬ 
matically.  The  first  ten  measures  are  simply  complexity  features 
that  do  not  consider  the  source  images  (the  no-source  compara¬ 
tive  class  according  to  the  classification  in  Section  I).  They  rep¬ 
resent  the  structure,  texture,  contrast  and/or  edge  intensities  in 
the  image  in  order  to  characterize  the  complexity  of  the  image. 
Such  measures  have  already  been  used  to  evaluate  the  quality 
of  image  fusion  algorithms  [17],  [18],  [38].  Most  of  these  mea¬ 
sures  have  been  inspired  by  work  to  develop  clutter  complexity 
measures  [19],  [39].  These  works  search  for  features  that  char¬ 
acterize  the  degree  to  which  the  background  appears  target-like 
[39],  Ideally,  the  clutter  complexity  determines  how  hard  it  is  to 
detect  or  classify  a  target  in  the  scene  due  to  the  complexity  of 
the  background.  The  last  seven  measures  compare  how  well  the 

4The  color  versions  of  the  CDWT,  CLAV,  and  CLMT  displays  in  Figs.  7-8 
are  available  in  the  online  version  of  this  paper. 
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(g)  00 

Fig.  7.  Eight  fused  image  displays  for  one  of  the  IS  scenarios:  (a)  II,  (b)  LWIR,  (c)  CONA,  (d)  CONB,  (e)  DWTT,  (f)  CDWT,  (g)  CLAV,  and  (h)  CLMT. 


(a)  (b)  (c)  (d)  (e)  (f)  (g)  (h)  (i) 

Fig.  8.  Example  of  the  eight  fused  image  displays  and  corresponding  silhouette  for  a  single  scene,  i.e.,  a  target  instance,  in  Fig.  7:  (a)  II,  (b)  LWIR,  (c)  CONA, 
(d)  CONB,  (e)  DWTT,  (f)  CDWT,  (g)  CLAV,  (h)  CLMT,  and  (i)  silhouette. 


salient  features  in  the  two  source  imagery  are  transferred  into 
the  fused  image  (the  source  comparative  class).  For  the  most 
part,  the  distinction  between  these  comparative  measures  is  in 
the  definition  of  saliency. 

Ideally,  the  FIQM  should  be  computed  automatically  from  the 
fused  and  source  images.  The  contrast  measure  is  considered  be¬ 
cause  it  is  one  of  the  measures  that  is  averaged  in  an  objective 


National  Imagery  Interpretability  Ratings  Scale  (NIIRS)  rating 
[41].  Furthermore,  it  is  intuitive  that  the  contrast  between  the 
target  and  the  background  facilitates  ease  of  detection.  The  con¬ 
trast  is  computed  by  manually  segmenting  human  silhouettes 
for  each  scene.  Fig.  8(i)  shows  an  example  of  the  silhouette  that 
separates  the  target  from  the  background.  The  white  part  in  the 
silhouette  denotes  target  pixels,  and  the  black  part  denotes  back- 
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TABLE  II 

List  of  the  Evaluated  FIQMs 


Category 

Index 

Measure  Description 

Contrast 

1 

Difference  of  intensity  or  color  between  the  target 

and  the  background 

Saturation  [17] 

2 

Normalized  histogram  peak 

STD 

3 

Standard  deviation 

Schmieder  Weathersby  [19] 

4 

Block  average  local  standard  deviation 

fBm  L20] 

5 

Hurst  parameter  for  fBm  model 

TIR  L21] 

6 

Block  average  target  interference  ratio  (contrast) 

Energy  [21] 

7 

Block  average  energy  of  histogram 

Entropy  [21] 

8 

Block  average  entropy  of  histogram 

Homogeneity  [21] 

9 

Block  average  pixel  variation 

Block  Outlier  [21] 

10 

Block  average  number  of  outliers 

Universal  Quality  Index  [23] 

1 1 

Average  Structure  SIMilarity  (SSIM)  index 

between  fused  and  reference  images 

Information 

12 

Average  mutual  information  between  fused 

Measures  [11] 

and  reference  images  (bin  size  =  16) 

Objective  Measure  [10] 

13 

Average  objective  edge  information 

between  fused  and  reference  images 

14 

Weighted  average  salient  quality  index  of  edge 

intensities  between  fused  and  reference  images 

Salient  Quality  Index  [12] 

15 

Weighted  average  salient  quality  index  between 

fused  and  reference  images 

16 

Average  salient  quality  index  between 

fused  and  reference  images 

Harris  Response  based 

17 

Difference  of  Harris  response  between  fused 

quality  metric  [40] 

and  reference  images 

ground  pixels.  The  measure  is  equivalent  to  the  percent  contrast 
used  in  [42].  For  grayscale  imagery,  it  is  defined  as 

contrast  =  ^  ^  (23) 

a 

where  It  and  1),  are  the  mean  target  and  background  intensi¬ 
ties,  respectively,  and  d  denotes  the  dynamic  range,  i.e.,  the  in¬ 
tensity  difference  between  the  brightest  and  darkest  pixels  in  a 
scene.  For  color  imagery,  the  RGB  coordinates  are  converted  to 
the  CIE  L*a*b*  color  space  [43]  and  the  single  band  contrast 
is  calculated  independently  over  the  L* ,  a*,  and  b*  bands  via 
(23).  Then  the  root  sum  square  of  the  three  single  band  con¬ 
trasts  is  reported  as  the  overall  contrast.  Since  the  information 
about  the  color  is  given  in  the  a*  and  b*  bands,  these  bands  ex¬ 
hibit  zero  contrast  for  grayscale  imagery,  and  the  color  version 
of  contrast  is  a  consistent  generalization  of  the  grayscale  defini¬ 
tion,  i.e.,  it  provides  the  same  answer  if  the  RGB  image  contains 
no  color.  Intuitively,  the  color  version  of  contrast  integrates  the 
contrast  that  exists  in  all  ways  the  eye  can  distinguish  the  fore¬ 
ground  from  the  background,  i.e.,  lightness  and  color.  It  might 


be  possible  to  generate  an  automated  contrast  measure  by  in¬ 
corporating  automated  image  segmentation  techniques.  This  is 
a  matter  of  future  investigation. 

While  the  generalization  of  contrast  for  color  imagery  is 
straightforward,  it  is  not  clear  how  to  best  extend  the  definition 
of  the  other  automatic  FIQMs  to  accommodate  color  imagery. 
To  this  end,  we  follow  the  convention  in  [39]  where  for  the 
color  images,  one  generates  four  color  measures  for  a  given 
grayscale  measure.  Namely,  the  grayscale  measure  is  computed 
over  each  RGB  band  and  summarized  by  the  1)  maximum, 
2)  minimum,  and  3)  median  values  over  all  bands.  The  fourth 
measure  is  computed  by  converting  the  RGB  image  into  a 
grayscale  image  before  calculating  the  measure. 

C.  Evaluation  Results  and  Discussion 

First,  we  evaluated  the  consistency  of  the  FIQMs  with  human 
detection  performance  over  the  five  grayscale  fused  image  dis¬ 
plays:  1)  II,  2)  LWIR,  3)  CONA,  4)  CONB,  and  5)  DWTT.  Then, 
we  considered  scoring  the  FIQMs  generalized  for  color  using  all 
eight  fused  image  displays. 
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TABLE  III 

List  of  DPMLR  Scores,  Associated  p-Values,  and  Average  Values  of  Some  Other 
Correlations  for  17  Grayscale  FIQMs  Tested  Over  Five  Image  Displays 


Index 

An 

p-value 

Mean  MC 

Mean  |MC| 

Mean  LC 

Mean  |LC| 

Mean  SC 

Mean  |SC| 

Mean  KC 

Mean  |KC| 

1 

1.3291 

0.1221 

0.5835 

0.8276 

0.5628 

0.8069 

0.5210 

0.6168 

0.4700 

0.5602 

2 

0.0204 

0.4284 

0.0153 

0.6962 

0.0105 

0.6382 

0.0053 

0.4158 

0.0190 

0.3425 

3 

0.0301 

0.3982 

0.1291 

0.7958 

0.1374 

0.7714 

0.0868 

0.4981 

0.0891 

0.4285 

4 

0.0340 

0.3884 

0.0192 

0.7965 

0.0180 

0.7911 

0.0720 

0.4498 

0.0711 

0.3807 

5 

0.5925 

0.1741 

0.5322 

0.8564 

0.5330 

0.8398 

0.3608 

0.5730 

0.3094 

0.4933 

6 

0.3637 

0.2073 

0.1380 

0.5493 

0.1381 

0.5490 

0.1381 

0.5490 

0.0649 

0.4158 

7 

0.0376 

0.3803 

0.0546 

0.7670 

0.0405 

0.7248 

0.0748 

0.4433 

0.0729 

0.3780 

8 

0.0392 

0.3768 

0.0921 

0.7636 

0.0883 

0.7386 

0.0896 

0.4434 

0.0745 

0.3730 

9 

0.0382 

0.3789 

-0.0106 

0.8068 

-0.0012 

0.7851 

-0.0360 

0.5160 

-0.0292 

0.4384 

10 

0.0422 

0.3709 

0.1453 

0.7723 

0.1385 

0.7555 

0.0927 

0.5002 

0.0807 

0.4197 

11 

0.0316 

0.3943 

-0.0292 

0.7543 

-0.0268 

0.7339 

0.0130 

0.4066 

0.0345 

0.3494 

12 

0.0362 

0.3833 

0.2030 

0.7262 

0.1667 

0.6771 

0.1363 

0.3950 

0.1315 

0.3457 

13 

0.0479 

0.3607 

0.1018 

0.7863 

0.1063 

0.7622 

0.0647 

0.4677 

0.0651 

0.4094 

14 

0.0252 

0.4120 

0.1454 

0.7668 

0.1426 

0.7581 

0.0792 

0.4340 

0.0870 

0.3743 

15 

0.0242 

0.4150 

0.1541 

0.7756 

0.1402 

0.7558 

0.0873 

0.4421 

0.0934 

0.3807 

16 

0.0387 

0.3778 

0.3124 

0.7694 

0.2969 

0.7391 

0.1534 

0.4396 

0.1430 

0.3898 

17 

0.3407 

0.2122 

0.1973 

0.5980 

0.1931 

0.5910 

0.1424 

0.3792 

0.1427 

0.3527 

Table  III  provides  the  composite  DPMLR  score  over  the  five 
grayscale  displays  of  the  35  scenes  for  each  of  the  17  grayscale 
measures  as  well  as  the  corresponding  p-values.  Note  that  for 
each  FIQM,  the  p-value  is  evaluated  by  calculating  the  prob¬ 
ability  of  obtaining  a  result  with  the  DPMLR  larger  than  the 
composite  DPMLR  score  listed  in  Table  III  when  the  Ho  hy¬ 
potheses  is  true.  Furthermore,  the  table  also  includes  the  average 
values  and  average  absolute  values  of  the  monotonic,  logistic, 
Spearman,  and  Kendall  correlations. 

The  second  column  of  Table  III  shows  that  the  composite 
DPMLR  scores  for  all  but  the  grayscale  contrast  measure  are 
significantly  less  than  one.  This  means  that  the  evidence  points 
to  the  fact  that  these  potential  FIQMs  are  viewed  as  noise  with 
respect  to  ordering  the  detection  probabilities  of  the  imagery. 
The  poor  performance  of  the  source  comparative  measures  may 
be  explained  by  structure  in  the  fused  and  source  images  that 
leads  to  good  interimage  correlation  but  that  has  no  (or  even 
negative)  effect  on  human  performance.  Examples  of  the  pit- 
falls  of  source  comparative  measures  when  the  ideal  image  is 
unknown  are  provided  in  [14]. 

For  the  grayscale  contrast  measure,  the  composite  DPMLR 
score  is  still  modest  at  1.3291  and  the  p-value  is  not  very  low. 
In  fact,  the  perfect  FIQM  that  consistently  ordered  the  number 
of  detections  y  over  all  35  scenes  would  provide  a  composite 
DPMLR  of  9.632.  This  means  that  while  there  is  evidence  to 
reject  the  null  hypothesis,  the  evidence  to  support  the  monotonic 
hypothesis  is  not  compelling.  However,  the  composite  DPMLR 
score  for  the  grayscale  contrast  measure  is  much  greater  than 
the  scores  for  the  others.  Thus,  the  contrast  measure  may  be  a 
key  aspect  to  a  proper  FIQM. 

From  Table  III,  one  can  see  that  the  orderings  of  the  FIQMs 
via  the  DPMLR  and  the  other  correlations  differ.  Also  note  that 
for  each  FIQM,  the  differences  between  the  average  correla¬ 
tions  and  the  average  absolute  correlations  indicate  a  consis¬ 
tency  issue  for  the  nature  of  monotonicity  over  the  35  scenes. 
The  contrast  measure  exhibits  by  far  the  largest  DPMLR.  How¬ 
ever,  its  average  absolute  values  of  the  MC  and  the  logistic  cor¬ 
relation  (mean  |MC|  and  mean  |LC|  in  Table  III)  are  less  than 
those  of  the  fBm,  respectively.  Furthermore,  the  other  average 


correlations  of  the  contrast  measure  are  only  slightly  larger  than 
those  of  the  fBm.  To  better  compare  these  two  measures,  and 
to  show  how  differently  the  DPMLR  and  the  other  four  correla¬ 
tions  evaluate  a  FIQM  based  upon  the  human  perception  results, 
we  present  the  human  detection  results  and  the  scores  of  the 
DPMLRT,  monotonic,  logistic,  Spearman,  and  Kendall  correla¬ 
tion  tests  for  each  scene  for  the  contrast  and  the  fBm  measures. 

Fig.  9  graphically  depicts  the  relationship  between  the  afore¬ 
mentioned  two  measures  and  the  human  performance  over  all  35 
scenes.  The  lines  marked  by  the  asterisk  correspond  to  the  con¬ 
trast  measure  and  the  lines  marked  by  the  circle  correspond  to 
the  fBm.  Since  only  the  five  gray  fused  image  displays  are  con¬ 
sidered  here,  for  each  scene  and  each  FIQM,  we  have  five  detec¬ 
tion  numbers  yr  G  [0, 8]  and  five  FIQM  values  ay  (1  ^  i  ^  5). 
In  each  plot  of  Fig.  9,  the  vertical  axis  denotes  the  number  of  hu¬ 
mans  that  detected  the  target,  while  the  horizontal  axis  stands  for 
the  rank  of  the  ay’s  sorted  in  ascending  order.  The  shade  of  the 
background  of  each  plot  indicates  the  significance  of  the  mono¬ 
tonic  ordering  for  each  scene.  The  significance  value  is  obtained 
by  calculating  the  DPMLR  of  the  given  ay's  for  an  imaginary 
FIQM  whose  values  perfectly  match  the  ty’s  in  the  monotoni- 
cally  increasing  order. 

Tables  IV  and  V  provide  the  ascending  and  descending 
DPMLRs  as  well  as  the  other  four  correlations  (monotonic, 
logistic,  Spearman,  and  Kendall)  over  each  scene  for  the 
contrast  measure  and  the  fBm  measure,  respectively.  Note 
that  in  Scenes  34  and  35,  the  same  number  of  detections  are 
obtained  for  five  different  displays.  Because  of  the  fact  that  the 
target  is  so  obvious  in  Scene  34,  all  eight  observers  detected 
it  successfully.  Similarly,  no  one  detected  the  target  in  Scene 
35  because  it  is  so  unclear.  Both  cases  are  naturally  ignored  as 
they  don’t  provide  any  information  on  the  monotonicity. 

One  very  important  property  of  the  DPMLR  is  that  it  can  cap¬ 
ture  the  significance  of  a  scene  based  upon  the  human  detection 
results,  and  accordingly  adjust  its  score  to  provide  a  more  pre¬ 
cise  evaluation.  The  significance,  as  defined,  is  determined  by 
the  number  of  unique  human  detection  values  and  the  spread 
of  these  values  over  the  dynamic  range  from  0  to  8  detections. 
Essentially,  the  significance  describes  how  easy  (or  difficult)  it 
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Fig.  9.  Scatter  plots  of  the  number  of  detections  versus  the  quality  rank  order  over  35  scenes  (“*”  represents  the  contrast  measure,  and  “o”  represents  the  fBm 
measure). 


TABLE  IV 

Statistics  for  the  Contrast  Measure  Where  MC,  LC,  SC,  and  KC  Are  the  Monotonic,  Logistic, 
Spearman,  and  Kendall  Correlations,  Respectively 


Scene  1 

Scene  2 

Scene  3 

Scene  4 

Scene  5 

Scene  6 

Scene  7 

A^  =  0.7043 

A  l  =  23.6503 

A  £  =  45.2298 

A  l  =  1.1634 

A  J  =  0.6730 

a£  =  31.4151 

A  l  =  0.0010 

Alt  =  0.0000 

Alt  =  0.0000 

Alt  =  0.0000 

Alt  =  0.0000 

Alt  =  0.0001 

Ag  =  0.0000 

Alt  =  0.0137 

MC  =  0.9379 

MC  =  0.9935 

MC  =  1.0000 

MC  =  0.9189 

MC  =  0.8376 

MC  =  1.0000 

MC  =  -0.5601 

LC  =  0.9379 

LC  =  0.9699 

LC  =  0.9990 

LC  =  0.9189 

LC  =  0.8376 

LC  =  1.0000 

LC  =  -0.5601 

SC  =  0.6000 

SC  =  0.9000 

SC  =  0.9747 

SC  =  0.6000 

SC  =  0.7000 

SC  =  0.9747 

SC  =  -0.2052 

KC  =  0.4000 

KC  =  0.8000 

KC  =  0.9487 

KC  =  0.4000 

KC  =  0.6000 

KC  =  0.9487 

KC  =  -0.3162 

Scene  8 

Scene  9 

Scene  10 

Scene  1 1 

Scene  12 

Scene  13 

Scene  14 

a£  =  29.1496 

A|  =  24.7705 

A  l  =  0.0274 

A|  =  11.1297 

At  =  0.2919 

aJ  =  0.0005 

A|  =  17.3610 

Ajt  =  0.0000 

Alt  =  0.0000 

Alt  =  0.0746 

Alt  =  0.0014 

Alt  =  0.0006 

a£  =  0.0000 

A|  =  0.0009 

MC  =  1.0000 

MC  =  1.0000 

MC  =  -0.6882 

MC  =  0.9830 

MC  =  0.6571 

MC  =  0.6124 

MC  =  1.0000 

LC  =  0.9826 

LC  =  0.8778 

LC  =  -0.6882 

LC  =  0.9558 

LC  =  0.5026 

LC  =  0.6124 

LC  =  l.(HXX) 

SC  =  0.9747 

SC  =  0.9747 

SC  =  -0.0513 

SC  =  0.9000 

SC  =  0.5643 

SC  =  0.3354 

SC  =  0.9747 

KC  =  0.9487 

KC  =  0.9487 

KC  =  0.1054 

KC  =  0.8000 

KC  =  0.5270 

KC  =  0.3586 

KC  =  0.9487 

Scene  15 

Scene  16 

Scene  17 

Scene  18 

Scene  19 

Scene  20 

Scene  21 

A|  =  9.4196 

A|  =  11.8403 

A  J  =  10.8893 

a|  =  10.4186 

Aj  =  1.0475 

a£  =  4.7336 

A|  =  0.1784 

Ag  =  0.0002 

Alt  =  0.0022 

Alt  =  0.0000 

Alt  =  0.0001 

Alt  =  0.0376 

Alt  =  0.0250 

A  ^  =  0.7175 

MC  =  0.9886 

MC  =  1.0000 

MC  =  1.0000 

MC  = 

.0000 

MC  =  0.7454 

MC  =  0.9354 

MC  =  -0.5345 

LC  =  0.9708 

LC  =  1.0000 

LC  =  1.0000 

LC  = 

.0000 

LC  =  0.7454 

LC  =  0.8815 

LC  =  -0.5345 

SC  =  0.8208 

SC  =  0.9487 

SC  =  0.8944 

SC  =  0.8944 

SC  =  0.6325 

SC  =  0.7906 

SC  =  -0.2635 

KC  =  0.7379 

KC  =  0.8944 

KC  =  0.8367 

KC  =  0.8367 

KC  =  0.4472 

KC  =  0.6708 

KC  =  -0.2236 

Scene  22 

Scene  23 

Scene  24 

Scene  25 

Scene  26 

Scene  27 

Scene  28 

A|  =  0.3496 

A  l  =  1.7957 

A  |  =  4.9996 

A  l  =  0.0108 

A  J  =  4.3533 

a|  =  0.5042 

A  l  =  2.0380 

Alt  =  0.2204 

Alt  =  0.0501 

Alt  =  0.0000 

Alt  =  0.1740 

Alt  =  0.0042 

Alt  =  0.0259 

Alt  =  0.1714 

MC  =  -0.4082 

MC  =  0.8898 

MC  =  1.0000 

MC  =  -0.6124 

MC  =  1.0000 

MC  =  0.6124 

MC  =  0.7638 

LC  =  -0.4082 

LC  =  0.8458 

LC  =  1.0000 

LC  =  -0.6124 

LC  =  1.0000 

LC  =  0.6124 

LC  =  0.5417 

SC  =  0.1118 

SC  =  0.4472 

SC  =  0.7071 

SC  =  -0.3536 

SC  =  0.7071 

SC  =  0.3536 

SC  =  0.5774 

KC  =  0.1195 

KC  =  0.3586 

KC  =  0.6325 

KC  =  -0.3162 

KC  =  0.6325 

KC  =  0.3162 

KC  =  0.5164 

Scene  29 

Scene  30 

Scene  31 

Scene  32 

Scene  33 

Scene  34 

Scene  35 

A|  =  3.5970 

A|  =0.4156 

A|  =  0.4156 

A|  =  2.4100 

aJ  =  0.7393 

a£  =  1 .0000 

A  |  =  1.0000 

Alt  =  0.0296 

Ag  =  1.2533 

Ag  =  1.2533 

Alt  =0.1818 

Alt  =  0.7393 

Alt  =  1.0000 

Alt  =  1 .0000 

MC  =  1.0000 

MC  =  -0.6124 

MC  =  -0.6124 

MC  =  1.0000 

MC  =  0.4082 

MC  =  NaN 

MC  =  NaN 

LC  =  1.0000 

LC  =  -0.6124 

LC  =  -0.6124 

LC  =  1.0000 

LC  =  0.4082 

LC  =  NaN 

LC  =  NaN 

SC  =  0.7071 

SC  =  -0.3536 

SC  =  -0.3536 

SC  =  0.7071 

SC  =  0.0000 

SC  =  NaN 

SC  =  NaN 

KC  =  0.6325 

KC  =  -0.3162 

KC  =  -0.3162 

KC  =  0.6325 

KC  =  0.0000 

KC  =  NaN 

KC  =  NaN 

is  for  random  noise  to  affect  the  order  of  the  human  detection 
results.  The  more  unique  values  that  the  human  detection  re¬ 
sults  take  in  a  scene,  the  less  likely  that  random  noise  will  order 
the  human  detection  results.  The  monotonic  and  logistic  corre¬ 
lations  give  a  value  of  one  whenever  a  scene’s  scatter  plot  is 


perfectly  monotonic,  as  observed  from  Scenes  3,  6,  8,  9,  14,  16, 
17,  18,  24,  26,  29,  and  32  in  Fig.  9  and  the  corresponding  sta¬ 
tistics  in  Table  IV.  The  Spearman  and  the  Kendall  correlations 
give  a  value  of  one  whenever  a  scene’s  scatter  plot  is  strictly 
monotonic,  as  observed  from  the  scatter  plot  of  Scene  5  and  the 
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TABLE  V 

Statistics  for  the  fBm  Where  MC,  LC,  SC  and  KC  Are  the  Monotonic,  Logistic,  Spearman,  and  Kendall  Correlations,  Respectively 


Scene  1 

Scene  2 

Scene  3 

Scene  4 

Scene  5 

Scene  6 

Scene  7 

=  0.7043 

A  l  =  0.7043 

a£  =0.3819 

A  l  =  1.5343 

A  l  =  35.3484 

Aj  =  6.8845 

A  J  =  0.0001 

Ag  =  0.0000 

Ag  =  0.0000 

Ag  =  0.0000 

Ag  =  0.0000 

Air  =  0.0000 

Air  =  0.0000 

Ag  =  1.0015 

MC  =  0.9379 

MC  =  0.9379 

MC  =  0.9396 

MC  =  0.9189 

MC  =  1.0000 

MC  =  0.9766 

MC  =  -0.8402 

LC  =  0.9379 

LC  =  0.9379 

LC  =  0.9396 

LC  =  0.8385 

LC  =  0.9766 

LC  =  0.9686 

LC  =  -0.7796 

SC  =  0.6000 

SC  =  0.6000 

SC  =  0.6669 

SC  =  0.6000 

SC  =  1.0000 

SC  =  0.8208 

SC  =  -0.6156 

KC  =  0.4000 

KC  =  0.4000 

KC  =  0.5270 

KC  =  0.4000 

KC  =  1.0000 

KC  =  0.7379 

KC  =  -0.5270 

Scene  8 

Scene  9 

Scene  10 

Scene  1 1 

Scene  12 

Scene  13 

Scene  14 

A  l  =  0.8749 

a£  =  0.0708 

A  l  =  0.0011 

a£  =  0.6460 

A  l  =  19.4999 

Aj  =  0.2934 

a£  =  0.0009 

Ag  =  0.0001 

Ac;  =  0.0000 

Ag  =  2.5692 

Air  =  0.0064 

Ag  =  0.0001 

Air  =  0.0000 

Ag  =  17.3610 

MC  =  0.8402 

MC  =  0.7605 

MC  =  -0.9081 

MC  =  0.8137 

MC  =  1.0000 

MC  =  0.9186 

MC  =  -1.0000 

LC  =  0.7956 

LC  =  0.7605 

LC  =  -0.8147 

LC  =  0.8137 

LC  =  0.9939 

LC  =  0.9186 

LC  =  -0.9801 

SC  =  0.5643 

SC  =  0.6669 

SC  =  -0.6669 

SC  =  0.3000 

SC  =  0.9747 

SC  =  0.7826 

SC  =  -0.9747 

KC  =  0.5270 

KC  =  0.5270 

KC  =  -0.5270 

KC  =  0.2000 

KC  =  0.9487 

KC  =  0.5976 

KC  =  -0.9487 

Scene  15 

Scene  16 

Scene  17 

Scene  18 

Scene  19 

Scene  20 

Scene  21 

a£  =  0.7544 

A  l  =  0.5725 

Aj  =  5.2378 

A  £  =  10.4186 

A  l  =  9.9553 

A  l  =  0.0779 

A  l  =  0.1318 

Ag  =  0.0007 

Ag  =  0.0099 

A^  =  0.0000 

Air  =  0.0001 

Xi  =  0.0071 

Ag  =  0.5866 

Ag  =  0.8082 

MC  =  0.9535 

MC  =  0.6757 

MC  =  0.9949 

MC  =  1.0000 

MC  = 

1.0000 

MC  =  -0.6124 

MC  =  -0.7638 

LC  =  0.9535 

LC  =  0.6233 

LC  =  0.9897 

LC  =  1.0000 

LC  =  0.9576 

LC  =  -0.5001 

LC  =  -0.7638 

SC  =  0.0513 

SC  =  0.3162 

SC  =  0.6708 

SC  =  0.8944 

SC  =  0.9487 

SC  =  -0.3162 

SC  =  -0.3689 

KC  =  -0.1054 

KC  =  0.2236 

KC  =  0.5976 

KC  =  0.8367 

KC  =  0.8944 

KC  =  -0.2236 

KC  =  -0.2236 

Scene  22 

Scene  23 

Scene  24 

Scene  25 

Scene  26 

Scene  27 

Scene  28 

Aj  =  0.1434 

A  J  =  0.2252 

Aj  =  4.9996 

A  l  =0.1740 

A  l  =  4.3533 

A  l  =  4.3533 

Ag  =  0.4822 

Air  =  1.0109 

Ag  =  0.6329 

Ag  =  0.0000 

Air  =0.0108 

Ag  =  0.0042 

Ag  =  0.0042 

Ag  =  0.4822 

MC  =  -0.6124 

MC  =  -0.6124 

MC  =  1 .0000 

MC  =  0.6124 

MC  = 

1.0000 

MC  =  1.0000 

MC  =  0.6124 

LC  =  -0.6124 

LC  =  -0.6124 

LC  =  1.0000 

LC  =  0.6124 

LC  =  1.0000 

LC  =  1.0000 

LC  =  0.6124 

SC  =  -0.3354 

SC  =  -0.2236 

SC  =  0.7071 

SC  =  0.3536 

SC  =  0.7071 

SC  =  0.7071 

SC  =  0.0000 

KC  =  -0.3586 

KC  =  -0.1195 

KC  =  0.6325 

KC  =  0.3162 

KC  =  0.6325 

KC  =  0.6325 

KC  =  0.0000 

Scene  29 

Scene  30 

Scene  31 

Scene  32 

Scene  33 

Scene  34 

Scene  35 

A  l  =  3.5970 

a£  =  2.4100 

Aj  =  0.7393 

Ajt  =  2.4100 

Ajt  =  1.2533 

a£  =  1.0000 

A  £  =  1.0000 

Ag  =  0.0296 

Ag  =  0.1818 

Air  =  0.7393 

Air  =0.1818 

Air  =  0.4156 

Ag  =  1.0000 

Ag  =  1.0000 

MC  =  1.0000 

MC  =  1.0000 

MC  =  0.4082 

MC  =  1.0000 

MC  =  0.6124 

MC  =  NaN 

MC  =  NaN 

LC  =  1.0000 

LC  =  1.0000 

LC  =  0.4082 

LC  =  1.0000 

LC  =  0.6124 

LC  =  NaN 

LC  =  NaN 

SC  =  0.7071 

SC  =  0.7071 

SC  =  0.0000 

SC  =  0.7071 

SC  =  0.3536 

SC  =  NaN 

SC  =  NaN 

KC  =  0.6325 

KC  =  0.6325 

KC  =  0.0000 

KC  =  0.6325 

KC  =  0.3162 

KC  =  NaN 

KC  =  NaN 

corresponding  statistics  in  Table  V.  However,  the  DPMLR  gives 
a  score  much  greater  than  one  in  the  significant  scenes.  Specif¬ 
ically,  in  Scenes  3,  6,  8,  9,  and  14,  the  likelihoods  for  noise  to 
order  the  data  are  much  slimmer  than  those  in  Scenes  24, 26, 29, 
and  32.  As  a  result,  the  DPMLR  provides  significantly  higher 
scores  in  the  former  than  in  the  later  as  seen  in  Table  IV. 

One  can  also  observe  that  the  miss-ordering  for  the  more 
significant  scenes  causes  lower  DPMLR  scores  than  those 
of  the  less  significant  scenes.  For  instance,  we  compare  the 
scatter  plots  of  Scenes  25  and  31  in  Fig.  9  as  well  as  the 
corresponding  descending  DPMLR  values  in  Table  IV.  The 
descending  DPMLR  for  Scene  25  is  much  smaller  than  that 
of  Scene  31  because  the  DPMLR  treats  the  miss-ordering 
in  Scene  31  as  due  to  the  measurement  noise.  On  the  other 
hand,  the  other  four  correlations  are  equally  unforgiving  of  the 
miss-ordering  regardless  of  the  significance  of  the  scene.  This 
is  because  that  the  correlations  are  invariant  to  linear  scaling 
of  the  human  detection  results,  whereas  the  DPMLR  uses  the 
binomial  measurement  model  to  determine  whether  or  not  the 
scale  of  the  miss-ordering  is  significant. 

Once  we  realize  the  DPMLR’ s  ability  to  incorporate  the  sig¬ 
nificance  of  each  scene  into  the  statistical  test,  it  is  easy  to  see 
why  the  DPMLR  provides  the  significantly  high  score  for  the 
contrast  measure.  Comparing  the  scatter  plots  of  these  two  mea¬ 
sures  in  the  first  20  scenes,  we  can  see  that  seven  of  them,  i.e., 
Scenes  3, 6,  8, 9, 14, 16,  and  17,  exhibit  the  perfect  monotonicity 
for  the  contrast  measure  and  some  miss-orderings  for  the  fBm 


measure.  In  fact,  the  nature  of  the  monotonic  relationship  for 
the  fBm  feature  flips  for  Scene  14,  i.e.,  it  is  perfectly  decreasing. 
Both  of  these  factors  lead  to  the  significantly  higher  DPMLR  for 
the  contrast  measure.  Because  the  contrast  measure  is  still  not 
nearly  monotonically  related  to  the  perception  results  of  many 
of  the  significant  scenes,  the  composite  DPMLR  score  is  only 
slightly  greater  than  one. 

Next,  we  used  all  TV  =  8  fused  image  displays  and  ran  an¬ 
other  DPMLRT  for  color-based  FIQMs  when  the  human  detec¬ 
tion  results  were  collected  over  o  =  8  observers.  The  com¬ 
posite  DPMLR  scores  of  the  64  color-based  FIQMs  derived 
from  the  16  automated  grayscale  FIQMs  are  low  and  not  in¬ 
cluded  here  for  the  sake  of  brevity.  On  the  other  hand,  the  color- 
based  contrast  measure  achieved  a  composite  DPMLR  score  of 
1 .4000,  which  is  slightly  greater  than  that  of  the  contrast  com¬ 
puted  only  over  the  five  grayscale  fused  image  displays.  Be¬ 
cause  the  number  of  fused  images  N  has  increased,  the  signifi¬ 
cance  of  this  “greater  than  one”  score  increases  and  the  p- value 
is  0.041972,  which  gives  stronger  support  for  the  monotonic  hy¬ 
pothesis.  Certainly,  the  color-based  contrast  is  able  to  incorpo¬ 
rate  the  contrast  from  both  the  luminance  and  color  components 
in  an  RGB  image  and  serves  as  a  potential  FIQM  that  is  able 
to  explain  some  of  the  human  performance.  Again,  the  perfect 
FIQM  would  provide  a  composite  DPMLR  of  54.5 150,  and  con¬ 
trast  is  only  one  aspect  of  a  good  FIQM,  which  has  yet  to  be 
identified. 
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V.  Conclusion 


Appendix 


This  paper  proposes  the  composite  DPMLR  to  quantify 
how  consistent  the  values  of  a  FIQM  are  with  measured 
human  performance  represented  by  the  probability  of  detec¬ 
tion.  Specifically,  the  DPMLR  can  be  used  to  test  whether 
or  not  a  monotonic  relationship  exists  between  the  FIQM 
and  the  underlying  human  detection  performance  that  is 
measured  via  a  perception  experiment.  The  resulting  test  is 
designed  to  be  applicable  even  when  the  number  of  observers 
is  small  so  that  the  measurement  errors  from  the  perceptual 
experiment  are  not  necessarily  Gaussian.  The  paper  discusses 
some  interesting  properties  of  the  DPMLR,  and  simulation 
results  demonstrate  the  advantages  of  the  DPMLR  over  other 
monotonic  statistics.  Unlike  the  MC  in  [17],  the  DPMLR 
seamlessly  accounts  for  the  spread  of  the  human  observations 
and  the  number  of  fused  images.  It  indicates  to  what  degree 
the  ordering  of  the  human  observations  by  the  FIQM  is  not 
by  random  chance.  The  DPMLRT  is  a  general  test  of  mono¬ 
tonicity  that  can  be  used  to  evaluate  monotonic  relationships 
beyond  the  image  fusion  application.  Finally,  the  DPMLR 
was  used  to  score  a  number  of  potential  FIQMs  using  real 
image  data  with  a  corresponding  perception  study. 

The  DPMLR  scores  reveal  that  a  proper  FIQM  for  the  de¬ 
tection  task  is  not  yet  available.  The  comparative  measures 
may  have  scored  poorly  because  the  salient  features  exploited 
by  these  measures  may  not  have  captured  the  context  in  II 
imagery  that  humans  exploit  for  detection.  Of  note,  the  con¬ 
trast  measure  does  demonstrate  some  utility  based  upon  its 
DPMLR  score,  and  is  clearly  one  aspect  that  drives  human  de¬ 
tection  performance.  Future  work  is  needed  to  identify  a  more 
meaningful  FIQM.  Such  a  measure  may  incorporate  aspects  of 
the  contrast  as  well  as  other  quality  features  of  both  the  lumi¬ 
nance  and  color  components  of  the  image.  However,  we  ex¬ 
pect  that  a  measure  needs  to  understand  what  context  is  avail¬ 
able  in  the  image,  which  makes  the  search  for  a  good  FIQM 
very  challenging. 

The  paper  revealed  many  interesting  properties  of  the 
DPMLR  and  conjectured  many  more  properties.  Future  work  is 
necessary  to  prove  (or  disprove)  these  conjectured  properties. 
Furthermore,  one  can  further  study  over  what  values  of  p  the 
DPMLRT  is  the  most  powerful  test. 

The  DPMLRT  does  incorporate  some  simplifying  assump¬ 
tions  that  could  be  relaxed  for  a  more  robust  test.  For  in¬ 
stance,  not  all  human  observers  are  created  equal  and  the  bi¬ 
nomial  distribution  may  not  be  the  best  model  for  the  percep¬ 
tion  results.  Furthermore,  the  values  of  pt  are  not  independent 
since  all  fusion  algorithms  attempt  to  provide  a  good  image 
for  human  perception.  The  paper  does  demonstrate  that  the 
DPMLRT  is  robust  as  these  model  assumptions  are  relaxed. 
In  addition,  the  DPMLRT  assumes  that  the  observers’  proba¬ 
bility  of  false  alarms  are  calibrated,  and  it  ignores  the  impact 
of  contextual  information,  which  may  be  known  a  priori  or 
obtained  in  the  image,  on  human  detection  performance.  Fu¬ 
ture  research  can  also  focus  on  statistical  scoring  mechanisms 
that  account  for  increasingly  realistic  data  models. 


A.  Proof  of  Property  1 

Proof:  Ajy(y,  o)  and  Ajy(y,  o)  can  be  expressed  as 


Ajv(y,o) 


Ajv(y>°) 


m  fr,  n,-i  pf  ( 1  -Pi)°  y'dp 
IvoU^iPl^-P^'dP 

and 

M  IVl  a=i  PfiX  ~  Pi)°~y'd p 


(24) 


(25) 


Note  that  the  integrands  in  the  numerator  and  denominator  are 
the  same.  This  integrand  is  strictly  positive  for  all  p  except  for 
a  finite  set  of  points  of  measure  zero,  namely  {p  :  Pi  <E  {0, 1}}. 
Any  integral  of  the  integrand  over  Vo ,  Vy,  Vy,  Vo\P and 
Vo\P[  must  be  strictly  positive.  Thus,  the  integrals  in  the  nu¬ 
merator  of  (24)  are  strictly  less  than  the  integrals  in  the  denom¬ 
inator.  Furthermore,  all  the  integrals  are  strictly  positive.  Thus 


IvlullPra-p^)o-y'dp 

U0Uii^-Pi)°-yidv 

iVl  n,=i  pf  o  -Pi)°~yidp 

Iv0Ui=i^-Pi)°-yidp 


(26) 


(27) 


Multiplication  by  TV!  leads  to  0  <  A*v(y.  o)  <  TV!  and 
0  <  A^-(y,  o)  <  TV!.  Because  the  ascending  and  descending 
DPMLRs  are  bounded  by  zero  and  TV!  for  each  scene,  it  is  clear 
by  (11)  that  the  composite  DPMLR  is  also  bounded  by  0  and 
TV!.  ■ 


B.  Proof  of  Property  3 

Proof:  Let  tty  :  {1,  2,. . . ,  TV}  i — >  {1, 2, . . . ,  TV}  be  a 
permutation  mapping  such  that  ~f  Ttk(j)  when  %  f  j. 

There  are  TV!  such  mappings,  and  let  each  mapping  be  identified 
with  a  unique  index  k  where  /,:  =  1, 2, . . . ,  TV!.  As  a  matter  of 
convention,  k  =  1  is  the  identity  mapping,  i.e.,  7Ti  (i)  =  i,  and 
k  =  TV!  is  the  reverse  sort,  i.e.,  t rjvi (*)  =  TV  +  1  —  i.  Each 
permutation  function  allows  one  to  define  an  ordering  of  the 
coordinates,  i.e., 

-  {p  •  0  ^  pTTk(l)  ^  PT^k  (2)  ^  ^  PlTk(N)  ^  1} 

(28) 


such  that  the  collection  of  all  TV!  orderings  defines  any  possible 
sequence  of  coordinate  values,  i.e., 

JV! 

Vo  =  U  nk.  (29) 

k= 1 
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Furthermore,  and  =  1Zn\  .  As  a  result 

N 

11 


Yl  'Pi' -  Pi)°  y,dp 


m  „  jv 

=e/  n  pf^-Pi)°~yidv-  (30) 

k= 1  i=l 

Using  the  change  of  variable  pnk(i)  ►  pi,  the  right  hand  side  of 
(30)  can  be  rewritten  as 


JV 

LU 


n  pf^-pif  y,dp 


JV!  -  JV 

=  E/  <ip-  (3D 

k= iM £  i 


If  ?/i  =1)2  =  ■■■  =  Vn,  then  we  have  yni(i)  =  y-,^  =  ...= 
Vnm(i)  =  Vi  for  i  =  1,2^,.. . ,  N.  It  follows  that: 


JV 


/  n pf^-PiT  mdp 

i  i 


=  N\  n^O-^-^P.  (32) 

fcl 


According  to  (24),  we  have  Ajy(y,  ol)  =  1  =  Ajy(y,  ol)  ■ 


C.  Proof  of  Properties  4  and  5 

Proof:  First  we  want  to  show  that  if  the  yf  s  are  in  as¬ 
cending  order  and  not  constant,  then 


is  greater  than  or  equal  to  unity  because  yn  ^  ygrl  _,  (n)  and  pn  ^ 

Pj  over  V-\.  By  taking  the  product  of  (34)  for  n  =  1 , 2,. . A’  — 

1,  we  have 


nili  pygo(i)(i-Pi)°-y^ 


(35) 


The  equality  occurs  only  if  yn  =  J/Sn_1(„)  for  n  = 

which  means  yf  s  are  equal.  Because  go  =  7Tfc  and  g^-i  is  the 

identity  map  and  the  yf  s  are  not  constant,  (33)  is  proven. 

Now,  integrating  both  sides  of  (33)  over  V]  leads  to 


JV 


/  Yl'PfO  -  Pi)°  V'dP 
’n  i= 1 


r  N 

>  /  TT p-"k0){\  -  p,)°  V7rktPdp  (36) 

i=l 


when  k  f-  1.  Similarly,  one  can  show  that 


when  k  yf  N\.  The  division  of  (36)  and  (37)  by  f3(yi  + 
1,  o  —  yi  +  1)  leads  to  the  first  statement  in  Property  4.  Similar 
arguments  prove  the  second  statement  in  Property  4. 

Summing  (36)  for  k  =  1, . . . ,  N\  leads  to 


„  JV  „  JV 

/  Y[pf(j-Pi)°~yidp < m  n^f-p^p. 
Ip o  £  I  Jft  ;  I 

(38) 


JV  JV 

H pf  (1  -  Pi)°~Vi  >  o  -  Pi)°~y^{i)  (33) 

2—1  2—1 

when  k  f  0. 

To  this  end,  we  transform  the  permutation  tt/,.  back  to  the 
identity.  Let’s  define  the  permutation  function  go(i)  =  ^ Tfc(i). 
For  the  first  step,  the  value  of  (?o(l)  is  switched  with  the  value 
go(jf)  where  go(j)  =  1  to  form  gi .  The  process  repeats  itself 
for  N  —  1  steps  such  that  for  the  nth  step,  the  value  of  gn_i(n ) 
is  switched  with  gn_i(j)  where  gn_i(j)  =  n  to  form  gn.  For¬ 
mally,  at  the  nth  step  we  have  gn (n)  =  =  gn-i(n), 

and  gn{i)  =  gn-i(i ),  where  j  =  gf]_t  (n),  i  >  n  and  i  f  j. 

Note  that  j  ^  n  and  gn-i(n)  ^  n  because  g^-i (i)  =  i  for 
i  =  1, 2, . . . ,  n  —  1.  After  N  —  1  steps,  7Ti  =  gn- 1  ■  After  the  nth 
step,  the  ratio  of  the  likelihoods  associated  with  permutations  gn 
and  gn-  \ ,  i.e.. 


Y[f=lpjan^{l-Pi)0-ya^ 

=  //u(i  -  P/) 

VPjO  -Pn)J 


(34) 


Then  A]y  (y,  o)  >  1.  Similarly,  (37)  can  be  reexpressed  as 
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so  that  Aj(r(y,  o)  <  1.  This  completes  the  proof  of  the  first 
statement  in  Property  5.  The  proof  of  the  second  statement  can 
be  proven  by  similar  arguments.  ■ 
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